scikit-learn 管道中的锁定步骤(防止改装)

ryt*_*ido 7 scikit-learn

是否有一种方便的机制来锁定 scikit-learn 管道中的步骤以防止它们在 pipeline.fit() 上重新拟合?例如:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]

pipeline = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("estimator", LinearSVC())
])

# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)

# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))
Run Code Online (Sandbox Code Playgroud)

在没有中间转换的情况下,我能想到的唯一方法是定义一个自定义估计器类(如这里所见),其 fit 方法不执行任何操作,其变换方法是预拟合转换器的变换。这是唯一的方法吗?

小智 4

查看代码,Pipeline 对象中似乎没有任何具有以下功能的内容:在管道上调用 .fit() 会导致每个阶段上的 .fit() 。

我能想到的最好的快速而肮脏的解决方案是猴子修补舞台的装配功能:

pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform

pipeline.fit(X, y)
Run Code Online (Sandbox Code Playgroud)