scikit-learn 管道中的 TargetEncoder(“来自category_encoders”)导致“GridSearchCV”索引错误

Duk*_*ong 5 python-3.x pandas scikit-learn

我正在对数据集中的某些功能使用目标编码。我的完整管道是这样的:

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from category_encoders.target_encoder import TargetEncoder

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

numeric_features = ['feature_1']
numeric_pipeline = Pipeline(steps=[('scaler', StandardScaler())])

ohe_features = ['feature_2', 'feature_3', 'feature_4']
ohe_pipeline = Pipeline(steps=[('ohe', OneHotEncoder())])

te_features = ['feature_5', 'feature_6']
te_pipeline = TargetEncoder()

preprocessor = ColumnTransformer(transformers=[
                                ('numeric', numeric_pipeline, numeric_features), 
                                ('ohe_features', ohe_pipeline, ohe_features), 
                                ('te_features', te_pipeline, te_features)
                                ]
               )

clf_lr = Pipeline(steps=[
                 ('preprocessor', preprocessor), 
                 ('classifier', LogisticRegression())
                 ]
         )

X_train, X_test, y_train, y_test = train_test_split(df_testing.drop(columns='target'), 
                                          df_testing['target'], 
                                         stratify=df_testing['target'])

params = {'classifier__C': [0.001, 0.01, 0.05, 0.1, 1]}

gs = GridSearchCV(clf_lr, params, cv=3)
gs.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)

问题在于,由于管道中的 TargetEncoder 步骤,对 GridSearchCV 中的 fit 方法的调用失败。具体来说,就是抛出

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Run Code Online (Sandbox Code Playgroud)

即使我reset_index(drop=True)同时调用X_trainy_train,我也会收到此错误。

如果我只是打电话:

clf_lr.fit(X_train.reset_index(drop=True), y_train.reset_index(drop=True))
clf_lr.score(X_test.reset_index(drop=True), y_train.reset_index(drop=True)) # both calls to reset_index required otherwise the same IndexingError is thrown
Run Code Online (Sandbox Code Playgroud)

该代码有效。但是,我需要交叉验证来找到 LogisticRegression 的最佳参数 C。这同样适用于我想尝试的任何其他模型的交叉验证。

如果这是 TargetEncoder 的已知问题,或者我是否错误地实现或安装了管道,有人可以告诉我吗?