在scikit-learn Pipeline中获取中间数据状态

thi*_*tbl 10 python scikit-learn

给出以下示例:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)
Run Code Online (Sandbox Code Playgroud)

我想在scikit学习管道中获得与tf_idf输出相对应的中间数据状态(在tf_idf上的fit_transform但不是NMF之后)或NMF输入.或者用另一种方式说出来,这与申请相同

TfidfVectorizer().fit_transform(data.test)
Run Code Online (Sandbox Code Playgroud)

我知道pipe.named_steps ["tf_idf"] ti获得中间变换器,但我无法获取数据,只能使用此方法获取变换器的参数.

Mar*_* V. 7

正如@Vivek Kumar在评论中所建议的那样,我在这里回答,我找到一个调试步骤,用于打印信息或将中间数据帧写入csv有用:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator


class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(X.shape)
        self.shape = shape
        # what other output you want
        return X

    def fit(self, X, y=None, **fit_params):
        return self

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("debug", Debug()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)
Run Code Online (Sandbox Code Playgroud)

编辑

我现在向调试转换器添加了一个状态.现在您可以通过@datasailor在答案中访问形状:

pipe.named_steps["debug"].shape
Run Code Online (Sandbox Code Playgroud)


dat*_*lor 6

据我了解,您想获取转换后的训练数据。您已经在中拟合了数据pipe.named_steps["tf_idf"],因此只需使用此拟合模型再次转换训练数据即可:

pipe.named_steps["tf_idf"].transform(data.test)
Run Code Online (Sandbox Code Playgroud)

  • 非常感谢。我使用了它,但是在我的情况下,管道很大,因此在我的tf_idf之前应用了很多转换器(给定的示例只是可复制的示例)。 (2认同)