Abr*_*ial 12 pipeline numpy python-3.x scikit-learn neuraxle
我正在使用Pipeline和FeatureUnion的scikit-learn来从不同的输入中提取特征.我的数据集中的每个样本(实例)都指的是具有不同长度的文档.我的目标是独立计算每个文档的顶部tfidf,但我不断收到此错误消息:
ValueError:blocks [0,:]具有不兼容的行维度.得到块[0,1] .shape [0] == 1,预计2000.
2000是训练数据的大小.这是主要代码:
book_summary= Pipeline([
('selector', ItemSelector(key='book')),
('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])
book_contents= Pipeline([('selector3', book_content_count())])
ppl = Pipeline([
('feats', FeatureUnion([
('book_summary', book_summary),
('book_contents', book_contents)])),
('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])
Run Code Online (Sandbox Code Playgroud)
我写了两个类来处理每个管道功能.我的问题是book_contents管道,它主要处理每个样本并独立返回每本书的TFidf矩阵.
class book_content_count():
def count_contents2(self, bookid):
book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')
book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
corpus=(str([user_data['text']]).strip('[]'))
return corpus
def transform(self, data_dict, y=None):
data_dict['bookid'] #from here take the name
text=data_dict['bookid'].apply(self.count_contents2)
vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
Xtr = vec_pipe.fit_transform(text)
return Xtr
def fit(self, x, y=None):
return self
Run Code Online (Sandbox Code Playgroud)
数据样本(示例):
title Summary bookid
The beauty and the beast is a traditional fairy tale... 10
ocean at the end of the lane is a 2013 novel by British 11
Run Code Online (Sandbox Code Playgroud)
然后每个id将引用一个文本文件,其中包含这些书籍的实际内容
我曾尝试toarray
和reshape
功能,但没有运气.知道如何解决这个问题.谢谢
您可以将Neuraxle 的功能联盟与需要您自己编码的自定义连接器一起使用。joiner 是一个传递给 Neuraxle 的 FeatureUnion 的类,用于按照您期望的方式将结果合并在一起。
from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion
Run Code Online (Sandbox Code Playgroud)
class BookContentCount(BaseStep):
def transform(self, data_dict, y=None):
transformed = do_things(...) # be sure to use SKLearnWrapper if you wrap sklearn items.
return transformed
def fit(self, x, y=None):
return self
Run Code Online (Sandbox Code Playgroud)
class CustomJoiner(NonFittableMixin, BaseStep):
def __init__(self):
BaseStep.__init__(self)
NonFittableMixin.__init__(self)
# def fit: is inherited from `NonFittableMixin` and simply returns self.
def transform(self, data_inputs):
# TODO: insert your own concatenation method here.
result = np.concatenate(data_inputs, axis=-1)
return result
Run Code Online (Sandbox Code Playgroud)
book_summary= Pipeline([
ItemSelector(key='book'),
TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])
p = Pipeline([
FeatureUnion([
book_summary,
BookContentCount()
],
joiner=CustomJoiner()
),
SVC(kernel='linear', class_weight='balanced')
])
Run Code Online (Sandbox Code Playgroud)
注意:如果您希望 Neuraxle 管道重新成为 scikit-learn 管道,您可以执行以下操作p = p.tosklearn()
。
了解有关 Neuraxle 的更多信息: https: //github.com/Neuraxio/Neuraxle
文档中的更多示例: https ://www.neuraxle.org/stable/examples/index.html
归档时间: |
|
查看次数: |
798 次 |
最近记录: |