Lim*_*huo 5 python scikit-learn sklearn-pandas
这个问题已经在这里讨论过,但还没有评论:https://github.com/scikit-learn/scikit-learn/issues/16473
我在 X 中有一些数字特征和分类特征。分类特征是一种热编码。所以我的管道类似于 sklearn 文档示例:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
Run Code Online (Sandbox Code Playgroud)
但如果我更改 passthrough=True,它将引发 TypeError,因为 passthrough 给出原始 X 并跳过管道的预处理部分:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Run Code Online (Sandbox Code Playgroud)
是否有办法使直通包括管道的第一个预处理部分?
我也无法在最终估计器前面添加预处理管道,因为它将原始 X 数据帧与最终层预测连接起来,最终层预测是一个 numpy 数组,如本文顶部的 github 讨论链接中所述。我的确切预处理管道有几个在 pandas 数据帧上运行的自定义转换器。
感谢您的任何帮助!