如何在 scikit-learn 管道中的 CountVectorizer 之前包含 SimpleImputer？

Question

如何在 scikit-learn 管道中的 CountVectorizer 之前包含 SimpleImputer？

Kev*_*ham 16 python machine-learning scikit-learn imputation countvectorizer

我有一个熊猫DataFrame，包括文本的一列，我想矢量化文本使用scikit学习的CountVectorizer。但是，文本包含缺失值，因此我想在矢量化之前估算一个常量值。

我最初的想法是创建一个Pipeline的SimpleImputer和CountVectorizer：

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

Run Code Online (Sandbox Code Playgroud)

但是，fit_transform错误是因为SimpleImputer输出2D 数组并CountVectorizer需要1D input。这是错误消息：

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Run Code Online (Sandbox Code Playgroud)

问题：我该如何修改Pipeline它才能使其正常工作？

注意：我知道我可以在 Pandas 中估算缺失值。但是，我想在 scikit-learn 中完成所有预处理，以便使用Pipeline.

Answer 1

Kev*_*ham 14

我已经找到了最好的解决办法是插入自定义转换到Pipeline该重塑的输出SimpleImputer被传递到之前，从2D到1D CountVectorizer。

这是完整的代码：

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

Run Code Online (Sandbox Code Playgroud)

已经提出在GitHub在于CountVectorizer应允许2D输入只要第二维是1（含义：数据的单个列）。那个修改CountVectorizer将是这个问题的一个很好的解决方案！

Answer 2

Ara*_*adi 6

一种解决方案是在 SimpleImputer 之外创建一个类并覆盖其transform()方法：

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


class ModifiedSimpleImputer(SimpleImputer):
    def transform(self, X):
        return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

Run Code Online (Sandbox Code Playgroud)

反之亦然： `class ModifiedCountVectorizer(CountVectorizer): def fit_transform(self, X, y=None): return super().fit_transform(X.flatten())` (2认同)

归档时间：	5 年，4 月前
查看次数：	1861 次
最近记录：	5 年，4 月前