如何在 scikit-learn 管道中的 CountVectorizer 之前包含 SimpleImputer?

Kev*_*ham 16 python machine-learning scikit-learn imputation countvectorizer

我有一个熊猫DataFrame,包括文本的一列,我想矢量化文本使用scikit学习的CountVectorizer。但是,文本包含缺失值,因此我想在矢量化之前估算一个常量值。

我最初的想法是创建一个PipelineSimpleImputerCountVectorizer

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()
Run Code Online (Sandbox Code Playgroud)

但是,fit_transform错误是因为SimpleImputer输出2D 数组CountVectorizer需要1D input。这是错误消息:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Run Code Online (Sandbox Code Playgroud)

问题:我该如何修改Pipeline它才能使其正常工作?

注意:我知道我可以在 Pandas 中估算缺失值。但是,我想在 scikit-learn 中完成所有预处理,以便使用Pipeline.

Kev*_*ham 14

我已经找到了最好的解决办法是插入自定义转换Pipeline该重塑的输出SimpleImputer被传递到之前,从2D到1D CountVectorizer

这是完整的代码:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()
Run Code Online (Sandbox Code Playgroud)

已经提出在GitHub在于CountVectorizer应允许2D输入只要第二维是1(含义:数据的单个列)。那个修改CountVectorizer将是这个问题的一个很好的解决方案!


Ara*_*adi 6

一种解决方案是在 SimpleImputer 之外创建一个类并覆盖其transform()方法:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


class ModifiedSimpleImputer(SimpleImputer):
    def transform(self, X):
        return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()
Run Code Online (Sandbox Code Playgroud)

  • 反之亦然: `class ModifiedCountVectorizer(CountVectorizer): def fit_transform(self, X, y=None): return super().fit_transform(X.flatten())` (2认同)