sklearn Stacking Estimator 直通跳过预处理并传递原始数据

Lim*_*huo 5 python scikit-learn sklearn-pandas

这个问题已经在这里讨论过,但还没有评论:https://github.com/scikit-learn/scikit-learn/issues/16473

我在 X 中有一些数字特征和分类特征。分类特征是一种热编码。所以我的管道类似于 sklearn 文档示例:

cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
              strategy='constant',
              fill_value='missing'),
OneHotEncoder(categories=categories)
)

num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)

processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')

lasso_pipeline = make_pipeline(processor_lin,
                           LassoCV())

rf_pipeline = make_pipeline(processor_nlin,
                        RandomForestRegressor(random_state=42))

gradient_pipeline = make_pipeline(
    processor_nlin,
    HistGradientBoostingRegressor(random_state=0))

estimators = [('Random Forest', rf_pipeline),
          ('Lasso', lasso_pipeline),
          ('Gradient Boosting', gradient_pipeline)]

stacking_regressor = StackingRegressor(estimators=estimators,
                                   final_estimator=RidgeCV())
Run Code Online (Sandbox Code Playgroud)

但如果我更改 passthrough=True,它将引发 TypeError,因为 passthrough 给出原始 X 并跳过管道的预处理部分:

/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: could not convert string to float: 'RL'
Run Code Online (Sandbox Code Playgroud)

是否有办法使直通包括管道的第一个预处理部分?

我也无法在最终估计器前面添加预处理管道,因为它将原始 X 数据帧与最终层预测连接起来,最终层预测是一个 numpy 数组,如本文顶部的 github 讨论链接中所述。我的确切预处理管道有几个在 pandas 数据帧上运行的自定义转换器。

感谢您的任何帮助!