sklearn Stacking Estimator 直通跳过预处理并传递原始数据

Lim*_*huo 5 python scikit-learn sklearn-pandas

这个问题已经在这里讨论过，但还没有评论：https://github.com/scikit-learn/scikit-learn/issues/16473

我在 X 中有一些数字特征和分类特征。分类特征是一种热编码。所以我的管道类似于 sklearn 文档示例：

cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
              strategy='constant',
              fill_value='missing'),
OneHotEncoder(categories=categories)
)

num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)

processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')

lasso_pipeline = make_pipeline(processor_lin,
                           LassoCV())

rf_pipeline = make_pipeline(processor_nlin,
                        RandomForestRegressor(random_state=42))

gradient_pipeline = make_pipeline(
    processor_nlin,
    HistGradientBoostingRegressor(random_state=0))

estimators = [('Random Forest', rf_pipeline),
          ('Lasso', lasso_pipeline),
          ('Gradient Boosting', gradient_pipeline)]

stacking_regressor = StackingRegressor(estimators=estimators,
                                   final_estimator=RidgeCV())

Run Code Online (Sandbox Code Playgroud)

但如果我更改 passthrough=True，它将引发 TypeError，因为 passthrough 给出原始 X 并跳过管道的预处理部分：

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: could not convert string to float: 'RL'

Run Code Online (Sandbox Code Playgroud)

是否有办法使直通包括管道的第一个预处理部分？

我也无法在最终估计器前面添加预处理管道，因为它将原始 X 数据帧与最终层预测连接起来，最终层预测是一个 numpy 数组，如本文顶部的 github 讨论链接中所述。我的确切预处理管道有几个在 pandas 数据帧上运行的自定义转换器。

感谢您的任何帮助！

归档时间：	5 年前
查看次数：	405 次
最近记录：	5 年前