无法在分类列上训练 xgboost

aru*_*836 12 python categorical-data xgboost

我正在尝试运行 Python 笔记本(链接)。在下面的行 [446]: where author train XGBoost,我收到一个错误

ValueError:数据的 DataFrame.dtypes 必须是 int、float 或 bool。没想到 StateHoliday、Assortment 字段中的数据类型

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Run Code Online (Sandbox Code Playgroud)

这是用于测试的最小代码

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Run Code Online (Sandbox Code Playgroud)

链接到 train_store 数据文件:链接 1

Zhi*_*uan 8

我在做罗斯曼销售预测项目时遇到了完全相同的问题。似乎新版本的 xgboost 不接受StateHolidayAssortmentStoreType的数据类型。您可以使用 Mykhailo Lisovyi 建议检查数据类型

print(test_train.dtypes)
Run Code Online (Sandbox Code Playgroud)

你需要用你的 X_train 替换这里的 test_train

你可能会得到

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64
Run Code Online (Sandbox Code Playgroud)

错误上升到对象类型。您可以将它们转换为

from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
Run Code Online (Sandbox Code Playgroud)

经过这些步骤,一切都会顺利。


Ati*_*esh 7

尝试这个

train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
Run Code Online (Sandbox Code Playgroud)

  • 如果你想在生产中使用经过训练的模型,并且将来需要对测试样本应用**相同的**编码,则必须使用另一种编码方式,例如Zhi Yuan在他的回答中所示的scikit Transformers,所以变换可以与模型一起保存。在新数据上运行 pd.to_numeric() 可能会导致与您最初在训练期间使用的映射不同! (2认同)