如何使用Scikit学习将功能与不同尺寸的输出相结合

Question

如何使用Scikit学习将功能与不同尺寸的输出相结合

Abr*_*ial 12 pipeline numpy python-3.x scikit-learn neuraxle

我正在使用Pipeline和FeatureUnion的scikit-learn来从不同的输入中提取特征.我的数据集中的每个样本(实例)都指的是具有不同长度的文档.我的目标是独立计算每个文档的顶部tfidf,但我不断收到此错误消息:

ValueError:blocks [0,:]具有不兼容的行维度.得到块[0,1] .shape [0] == 1,预计2000.

2000是训练数据的大小.这是主要代码:

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])

Run Code Online (Sandbox Code Playgroud)

我写了两个类来处理每个管道功能.我的问题是book_contents管道,它主要处理每个样本并独立返回每本书的TFidf矩阵.

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
                      corpus=(str([user_data['text']]).strip('[]')) 
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        text=data_dict['bookid'].apply(self.count_contents2)
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

Run Code Online (Sandbox Code Playgroud)

数据样本(示例):

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

Run Code Online (Sandbox Code Playgroud)

然后每个id将引用一个文本文件,其中包含这些书籍的实际内容

我曾尝试toarray和reshape功能,但没有运气.知道如何解决这个问题.谢谢

Answer 1

Gui*_*ier 1

您可以将Neuraxle 的功能联盟与需要您自己编码的自定义连接器一起使用。joiner 是一个传递给 Neuraxle 的 FeatureUnion 的类，用于按照您期望的方式将结果合并在一起。

1.导入Neuraxle的类。

from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion

Run Code Online (Sandbox Code Playgroud)

2. 通过继承 BaseStep 来定义您的自定义类：

class BookContentCount(BaseStep): 

    def transform(self, data_dict, y=None):
        transformed = do_things(...)  # be sure to use SKLearnWrapper if you wrap sklearn items.
        return transformed

    def fit(self, x, y=None):
        return self

Run Code Online (Sandbox Code Playgroud)

3. 创建一个连接器以按照您希望的方式连接功能联合的结果：

class CustomJoiner(NonFittableMixin, BaseStep):
    def __init__(self):
        BaseStep.__init__(self)
        NonFittableMixin.__init__(self)

    # def fit: is inherited from `NonFittableMixin` and simply returns self.

    def transform(self, data_inputs):
        # TODO: insert your own concatenation method here.
        result = np.concatenate(data_inputs, axis=-1)
        return result

Run Code Online (Sandbox Code Playgroud)

4. 最后通过将连接器传递给FeatureUnion来创建管道：

book_summary= Pipeline([
    ItemSelector(key='book'),
    TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])

p = Pipeline([
    FeatureUnion([
        book_summary,
        BookContentCount()
    ], 
        joiner=CustomJoiner()
    ),
    SVC(kernel='linear', class_weight='balanced')
])

Run Code Online (Sandbox Code Playgroud)

注意：如果您希望 Neuraxle 管道重新成为 scikit-learn 管道，您可以执行以下操作p = p.tosklearn()。

了解有关 Neuraxle 的更多信息： https: //github.com/Neuraxio/Neuraxle

文档中的更多示例： https ://www.neuraxle.org/stable/examples/index.html

归档时间：	7 年，3 月前
查看次数：	798 次
最近记录：	5 年，10 月前