Sklearn Pipeline:在ColumnTransformer中的OneHotEncode之后获取功能名称

Res*_*per 12 python scikit-learn

我准备好管道后要获取功能名称。

categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
Run Code Online (Sandbox Code Playgroud)

然后

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', GradientBoostingRegressor())])
Run Code Online (Sandbox Code Playgroud)

拟合熊猫数据框后,我可以从中获得功能重要性

clf.steps[1][1].feature_importances_

我尝试了clf.steps[0][1].get_feature_names()但是出现了一个错误

AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.
Run Code Online (Sandbox Code Playgroud)

如何从中获取功能名称?

ZAK*_*ZKI 37

Scikit-Learn 1.0 现在具有跟踪功能名称的新功能。

\n
from sklearn.compose import make_column_transformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\n# SimpleImputer does not have get_feature_names_out, so we need to add it\n# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will\n# have this method.\n# g\nSimpleImputer.get_feature_names_out = (lambda self, names=None:\n                                       self.feature_names_in_)\n\nnum_pipeline = make_pipeline(SimpleImputer(), StandardScaler())\ntransformer = make_column_transformer(\n    (num_pipeline, ["age", "height"]),\n    (OneHotEncoder(), ["city"]))\npipeline = make_pipeline(transformer, LinearRegression())\n\n\n\ndf = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],\n                   "age": [32, 65, 18, 24],\n                   "height": [172, 163, 169, 190],\n                   "weight": [65, 62, 54, 95]},\n                  index=["Alice", "Bunji", "C\xc3\xa9cile", "Dave"])\n\n\n\npipeline.fit(df, df["weight"])\n\n\n## get pipeline feature names\npipeline[:-1].get_feature_names_out()\n\n\n## specify feature names as your columns\npd.DataFrame(pipeline[:-1].transform(df),\n             columns=pipeline[:-1].get_feature_names_out(),\n             index=df.index)\n
Run Code Online (Sandbox Code Playgroud)\n

  • @AndiAnderle get_feature_names_out 并未在所有估计器上实现,请参阅 https://github.com/scikit-learn/scikit-learn/issues/21308 ,我使用 pipeline[:-1] 仅选择列转换器步骤。 (2认同)

Ven*_*lam 11

您可以使用以下代码段访问feature_names!

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)
Run Code Online (Sandbox Code Playgroud)

可重现的示例:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'brand'      : ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
                   'category'   : ['asdf','asfa','asdfas','as'], 
                   'num1'       : [1, 1, 0, 0] ,
                   'target'     : [0.2,0.11,1.34,1.123]})



numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor',  LinearRegression())])
clf.fit(df.drop('target',1),df['target'])

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)

# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
#  'category_asdf' 'category_asdfas' 'category_asfa']
Run Code Online (Sandbox Code Playgroud)

更新:

使用sklearn> = 0.21版本,我们可以使其更简单:

clf['preprocessor'].transformers_[1][1]['onehot']\
                         .get_feature_names(categorical_features)
Run Code Online (Sandbox Code Playgroud)

  • 完全一样,但是如何确保它们以正确的顺序组合在一起,从而使其与特征重要性向量匹配?似乎不大胆,会欣赏优美的代码片段 (11认同)
  • 组合顺序将与管道步骤相同。因此我们可以找到特征的确切顺序。/sf/answers/4027388291/ 答案可能对您有用 (4认同)
  • 如何将功能重要性与所有功能名称(数字+分类)正确匹配?特别是OHE(handle_unknown ='ignore')。 (2认同)
  • 所以 `StandardScaler()` 没有 `get_feature_names()` 。我们是否必须稍后将数字字段名称和热编码字段名称组合起来?有没有其他 API 可以为我们提供完整的功能名称? (2认同)

lou*_*isD 6

编辑:实际上彼得的评论答案在ColumnTransformer 文档中

转换后的特征矩阵中列的顺序遵循在转换器列表中指定列的顺序。除非在 passthrough 关键字中指定,否则未指定的原始特征矩阵的列将从生成的转换特征矩阵中删除。那些通过直通指定的列被添加到变压器输出的右侧。


为了用 Paul 在他的评论中提出的问题来完成 Venkatachalam 的回答,出现在 ColumnTransformer .get_feature_names() 方法中的特征名称的顺序取决于 ColumnTransformer 实例中的 steps 变量的声明顺序。

我找不到任何文档,所以我只是玩了下面的玩具示例,这让我理解了逻辑。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler

class testEstimator(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        return self

    def transform(self,X):
        return np.full(X.shape, self.string).reshape(-1,1)

    def get_feature_names(self):
        return self.string

transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)

dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)

for name,step in pipeline.named_steps.items():
    if hasattr(step, 'get_feature_names'):
        print(step.get_feature_names())
Run Code Online (Sandbox Code Playgroud)

为了有一个更具代表性的例子,我添加了一个 RobustScaler 并将 ColumnTransformer 嵌套在管道上。顺便说一下,你会发现我的 Venkatachalam 版本的方法来获取步骤的功能名称循环。通过使用列表推导式解压缩名称,您可以将其转换为稍微更有用的变量:

[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
Run Code Online (Sandbox Code Playgroud)

因此,使用 dt_test 和估算器来了解如何构建功能名称,以及如何将其连接到 get_feature_names() 中。这是另一个使用输入列输出 2 列的转换器示例:

class testEstimator3(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        self.unique = np.unique(X)[0]
        return self

    def transform(self,X):
        return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)

    def get_feature_names(self):
        return list((self.unique,self.string))

dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)

transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)

pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
    if hasattr(step[1], 'get_feature_names'):
        print(step[1].get_feature_names())
Run Code Online (Sandbox Code Playgroud)


Muh*_*nur 5

如果您正在寻找如何在最后一个为 的连续管道之后访问列名称ColumnTransformer,您可以按照此处的以下示例来访问它们:

里面full_pipeline有两条gender管道relevent_experience

full_pipeline = ColumnTransformer([
    ("gender", gender_encoder, ["gender"]),
    ("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])
Run Code Online (Sandbox Code Playgroud)

管道gender看起来像这样:

gender_encoder = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ("cat", OneHotEncoder())
])
Run Code Online (Sandbox Code Playgroud)

安装后full_pipeline,您可以使用以下代码片段访问列名称

full_pipeline.transformers_[0][1][1].get_feature_names_out() 
Run Code Online (Sandbox Code Playgroud)

就我而言,输出是: array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)

  • 这对我不起作用,因为我收到 AttributeError: 'ColumnTransformer' object has no attribute 'transformers_' (2认同)