Sue*_*imh 5 python pipeline scikit-learn
我正在使用sklearn训练模型,我的训练序列需要运行两个不同的特征提取管道。
出于某种原因,每个管道都可以毫无问题地拟合数据,并且当它们顺序出现时,它们也可以转换数据而没有问题。
但是,在已安装第二条管线之后调用第一条管线时,第一条管线已更改,这会导致尺寸不匹配错误。
在下面的代码中,您可以重新创建问题(我已对其进行了大幅简化,实际上我的两个管道使用了不同的参数,但这是最小可重复的示例)。
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer = CountVectorizer()
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data1)
print(pipeline1.transform(data1))
# Works fine
pipeline2 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data2)
print(pipeline2.transform(data2))
# Works fine
print(pipeline1.transform(data1))
# ValueError: dimension mismatch
Run Code Online (Sandbox Code Playgroud)
显然,“ pipeline2”的拟合在某种程度上干扰了“ pipeline1”,但我不知道为什么。我希望能够同时使用它们。
正如您vectorizer首先定义的那样,会发生以下情况:
vectorizer你适合第一条管道:
你适合第二条管道:
您回调第一个管道:
如何验证这一点:
vectorizer = CountVectorizer()
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', vectorizer)]).fit(data1)
print(pipeline1.transform(data1).shape)
Run Code Online (Sandbox Code Playgroud)
(3, 4)
# Works fine
pipeline2 = Pipeline([('vec', vectorizer)]).fit(data2)
print(pipeline2.transform(data2).shape)
Run Code Online (Sandbox Code Playgroud)
(3, 6)
# Works fine
# vectorizer = CountVectorizer()
print(pipeline1.transform(data1).shape)
Run Code Online (Sandbox Code Playgroud)
(3, 6)
您只需在管道中包含矢量化器的定义,如下所示:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data1)
print(pipeline1.transform(data1))
# Works fine
pipeline2 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data2)
print(pipeline2.transform(data2))
# Works fine
print(pipeline1.transform(data1))
Run Code Online (Sandbox Code Playgroud)