aru*_*836 12 python categorical-data xgboost
我正在尝试运行 Python 笔记本(链接)。在下面的行 [446]: where author train XGBoost,我收到一个错误
ValueError:数据的 DataFrame.dtypes 必须是 int、float 或 bool。没想到 StateHoliday、Assortment 字段中的数据类型
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Run Code Online (Sandbox Code Playgroud)
这是用于测试的最小代码
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day',
'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen',
'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, # 30% for the evaluation set
random_state = 42)
# base parameters
params = {
'booster': 'gbtree',
'objective': 'reg:linear', # regression task
'subsample': 0.8, # 80% of data to grow trees and prevent overfitting
'colsample_bytree': 0.85, # 85% of features used
'eta': 0.1,
'max_depth': 10,
'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Run Code Online (Sandbox Code Playgroud)
链接到 train_store 数据文件:链接 1
我在做罗斯曼销售预测项目时遇到了完全相同的问题。似乎新版本的 xgboost 不接受StateHoliday、Assortment和StoreType的数据类型。您可以使用 Mykhailo Lisovyi 建议检查数据类型
print(test_train.dtypes)
Run Code Online (Sandbox Code Playgroud)
你需要用你的 X_train 替换这里的 test_train
你可能会得到
DayOfWeek int64
Promo int64
StateHoliday int64
SchoolHoliday int64
StoreType object
Assortment object
CompetitionDistance float64
CompetitionOpenSinceMonth float64
CompetitionOpenSinceYear float64
Promo2 int64
Promo2SinceWeek float64
Promo2SinceYear float64
Year int64
Month int64
Day int64
Run Code Online (Sandbox Code Playgroud)
错误上升到对象类型。您可以将它们转换为
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
Run Code Online (Sandbox Code Playgroud)
经过这些步骤,一切都会顺利。
尝试这个
train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
Run Code Online (Sandbox Code Playgroud)