Jon*_*tel 5 python scikit-learn xgboost
我在新安装 xgboost 时遇到了一个奇怪的问题。正常情况下它工作得很好。但是,当我在以下函数中使用该模型时,它会在标题中给出错误。
我使用的数据集是从kaggle借来的,可以在这里看到: https: //www.kaggle.com/kemical/kickstarter-projects
我用来拟合模型的函数如下:
def get_val_scores(model, X, y, return_test_score=False, return_importances=False, random_state=42, randomize=True, cv=5, test_size=0.2, val_size=0.2, use_kfold=False, return_folds=False, stratify=True):
    print("Splitting data into training and test sets")
    if randomize:
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=True, random_state=random_state)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
    else:
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=False)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
    print(f"Shape of training data, X: {X_train.shape}, y: {y_train.shape}.  Test, X: {X_test.shape}, y: {y_test.shape}")
    if use_kfold:
        val_scores = cross_val_score(model, X=X_train, y=y_train, cv=cv)
    else:
        print("Further splitting training data into validation sets")
        if randomize:
            if stratify:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train, shuffle=True)
            else:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=True)
        else:
            if stratify:
                print("Warning! You opted to both stratify your training data and to not randomize it.  These settings are incompatible with scikit-learn.  Stratifying the data, but shuffle is being set to True")
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train,  shuffle=True)
            else:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=False)
        print(f"Shape of training data, X: {X_train_.shape}, y: {y_train_.shape}.  Val, X: {X_val.shape}, y: {y_val.shape}")
        print("Getting ready to fit model.")
        model.fit(X_train_, y_train_)
        val_score = model.score(X_val, y_val)
        
    if return_importances:
        if hasattr(model, 'steps'):
            try:
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
            except:
                model.fit(X_train, y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
        else:
            try:
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
            except:
                model.fit(X_train, y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
            
    mod_scores = {}
    try:
        mod_scores['validation_score'] = val_scores.mean()
        if return_folds:
            mod_scores['fold_scores'] = val_scores
    except:
        mod_scores['validation_score'] = val_score
        
    if return_test_score:
        mod_scores['test_score'] =  model.score(X_test, y_test)
            
    if return_importances:
        return mod_scores, feats
    else:
        return mod_scores
我遇到的奇怪部分是,如果我在 sklearn 中创建一个管道,它可以在函数外部的数据集上工作,但不能在函数内部工作。例如:
from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from xgboost import XGBClassifier
pipe = make_pipeline(OrdinalEncoder(), XGBClassifier())
X = df.drop('state', axis=1)
y = df['state']
在这种情况下,pipe.fit(X, y)效果很好。但get_val_scores(pipe, X, y)失败并出现标题中的错误消息。更奇怪的是,这get_val_scores(pipe, X, y)似乎适用于其他数据集,例如泰坦尼克号。X_train当模型在和上拟合时会出现错误y_train。
在这种情况下,损失函数为binary:logistic,并且该state列具有值successful和failed。
xgboost 库目前正在更新以修复此错误,因此当前的解决方案是将库降级到旧版本,对我来说,我已经通过降级到 xgboost v0.90 解决了这个问题
尝试通过cmd检查您的xgboost版本:
python 
import xgboost
print(xgboost.__version__)
exit()
如果版本不是 0.90,则通过以下方式卸载当前版本:
pip uninstall xgboost
安装xgboost版本0.90
pip install xgboost==0.90
再次运行你的代码!
| 归档时间: | 
 | 
| 查看次数: | 3690 次 | 
| 最近记录: |