如何将 sklearn Pipeline 上的每个步骤应用于选定的列?

Nav*_*ala 1 python pipeline python-3.x scikit-learn

当我查找 sklearn.Pipeline 中的步骤如何准备仅在某些列上操作时,我从stackoverflow 上的这个答案中偶然发现了sklearn.Pipeline.FeatureUnion。但是,我不太清楚如何不对我不想要的列应用任何内容并将完整的数据传递到下一步。例如,在我的第一步中,我只想应用于某些列,可以使用下面所示的代码来完成,但问题是下一步将只有标准缩放的列。如何在下一步中获得完整的数据以及上一步中标准缩放的列?StandardScaler

这是一些示例代码:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]


pipe = Pipeline([
    # steps below applies on only some columns
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=[list of numeric column names]), StandardScaler())),
    ])),
    ('feature_engineer_step1', FeatEng_1()),
    ('feature_engineer_step2', FeatEng_2()),
    ('feature_engineer_step3', FeatEng_3()),
    ('remove_skew', Skew_Remover()),

    # below step applies on all columns
    ('model', RandomForestRegressor())
])
Run Code Online (Sandbox Code Playgroud)

编辑:

由于所选答案没有任何示例代码,因此我将我的代码粘贴到此处,供任何可能遇到此问题并希望找到有效代码的人使用。下面示例中使用的数据是 google colab 附带的加利福尼亚州住房数据。

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# writing a column transformer that operates on some columns
num_cols = ['housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income']
p_stand_scaler_1 = ColumnTransformer(transformers=[('stand_scale', StandardScaler(), num_cols)],
                                     # set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
                                     remainder='passthrough')

# make a pipeline now with all the steps
pipe_1 = Pipeline(steps=[('standard_scaler', p_stand_scaler_1),
                         ('rf_regressor', RandomForestRegressor(random_state=100))])

# pass the data now to fit
pipe_1.fit(house_train.drop('median_house_value', axis=1), house_train.loc[:,'median_house_value'])

# make predictions
pipe_predictions = pipe_1.predict(house_test.drop('median_house_value', axis=1))
Run Code Online (Sandbox Code Playgroud)

小智 6

您可以使用 sklearn 中的 ColumnTransformer。这是一个可以帮助您的片段。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

#transform columns
#num_cols = numerical columns, categorical_col = categorical columns
preprocessor = ColumnTransformer(transformers = [('minmax',MinMaxScaler(), num_cols),
                                                 ('onehot', OneHotEncoder(), categorical_col)])

#model
model = RandomForestClassifier(random_state=0)

#model pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

model_pipeline.fit(x_train, y_train)

Run Code Online (Sandbox Code Playgroud)

  • 我发现删除它会删除列转换器中非指定的列,并且仅将指定的列传递到下一步。名为 `remainder='passthrough'` 的参数必须显式设置为不删除并将非指定列包含在列转换器中。 (2认同)