python 上的 XGBoost:xgb.cv 有什么问题?

Fag*_*ain 0 python classification cross-validation xgboost

我正在尝试在 python 上使用 xgboost。这是我的代码。xgb.train工作,但我收到一个错误xgb.cv,尽管我似乎以正确的方式使用它。

以下对我有用:

###### XGBOOST ######

import datetime
startTime = datetime.datetime.now() 

import xgboost as xgb
data_train   = np.array(traindata.drop('Category',axis=1))
labels_train = np.array(traindata['Category'].cat.codes)

data_valid   = np.array(validdata.drop('Category',axis=1))
labels_valid = np.array(validdata['Category'].astype('category').cat.codes)

weights_train = np.ones(len(labels_train))
weights_valid  = np.ones(len(labels_valid ))

dtrain = xgb.DMatrix( data_train, label=labels_train,weight = weights_train)
dvalid  = xgb.DMatrix( data_valid , label=labels_valid ,weight = weights_valid )




param = {'bst:max_depth':5, 'bst:eta':0.05, # eta [default=0.3]
         #'min_child_weight':1,'gamma':0,'subsample':1,'colsample_bytree':1,'scale_pos_weight':0, # default
         # max_delta_step:0 # default
         'min_child_weight':5,'scale_pos_weight':0, 'max_delta_step':2,
         'subsample':0.8,'colsample_bytree':0.8,
         'silent':1, 'objective':'multi:softprob' }


param['nthread'] = 4
param['eval_metric'] = 'mlogloss'
param['lambda'] = 2
param['num_class']=39

evallist  = [(dtrain,'train'),(dvalid,'eval')] # if there is a validation set
# evallist  = [(dtrain,'train')]                   # if there is no validation set

plst = param.items()
plst += [('ams@0','eval_metric')]

num_round = 100

bst = xgb.train( plst, dtrain, num_round, evallist,early_stopping_rounds=5 ) # early_stopping_rounds=10 # when there is a validation set

# bst.res=xgb.cv(plst,dtrain,num_round,nfold = 5,evallist,early_stopping_rounds=5)

bst.save_model('0001.model')

# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
# bst.dump_model('dump.raw.txt','featmap.txt')

x = datetime.datetime.now() - startTime
print(x)
Run Code Online (Sandbox Code Playgroud)

但是如果我改变线路......

bst = xgb.train( plst, dtrain, num_round, evallist,early_stopping_rounds=5 ) 
Run Code Online (Sandbox Code Playgroud)

……对这个……

bst.res = xgb.cv(plst,dtrain,num_round,nfold = 5,evallist,early_stopping_rounds=5)
Run Code Online (Sandbox Code Playgroud)

...我收到以下意外错误:

文件“”,第 45 行 bst.res=xgb.cv(plst,dtrain,num_round,nfold = 5,evallist,early_stopping_rounds=5) 语法错误:关键字 arg 后的非关键字 arg

EDIT1:我也尝试更改关键字的顺序:

bst.res = xgb.cv(plst,dtrain,num_round,evallist,nfold = 5,early_stopping_rounds=5) 
Run Code Online (Sandbox Code Playgroud)

...我收到以下错误:

--------------------------------------------------------------------------- 
TypeError                                 
Traceback (most recent call last) <ipython-input-49-36177ef64bab> in <module>()
      43 # bst = xgb.train( plst, dtrain, num_round, evallist,early_stopping_rounds=5 ) # early_stopping_rounds=10 # when   there is a validation set
      44 
 ---> 45 bst.res=xgb.cv(plst,dtrain,num_round,evallist,nfold =5 ,early_stopping_rounds=5)
      46 
      47 bst.save_model('0001.model')

 TypeError: cv() got multiple values for keyword argument 'nfold'
Run Code Online (Sandbox Code Playgroud)

EDIT2 毕竟,CV 中不需要验证集。evalsxgb.cv 的签名中没有参数(尽管它存在于xgb.train),所以我删除了它并将该行更改为:

bst.res=xgb.cv(params=plst,dtrain=dtrain,num_boost_round=num_round,nfold = 5,early_stopping_rounds=5)
Run Code Online (Sandbox Code Playgroud)

然后我收到这个错误

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.pyc
in cv(params, dtrain, num_boost_round, nfold, metrics, obj, feval,
maximize, early_stopping_rounds, fpreproc, as_pandas, show_progress,
show_stdv, seed)
    413     best_score_i = 0
    414     results = []
--> 415     cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc)
    416     for i in range(num_boost_round):
    417         for fold in cvfolds:  
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.pyc
in mknfold(dall, nfold, param, seed, evals, fpreproc)
    280         else:
    281             tparam = param
--> 282         plst = list(tparam.items()) + [('eval_metric', itm) for itm in evals]
    283         ret.append(CVPack(dtrain, dtest, plst))
    284     return ret
AttributeError: 'list' object has no attribute 'items'
Run Code Online (Sandbox Code Playgroud)

Mat*_*ury 6

这是 的签名xgboost.cv,从文档中复制

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False,
    folds=None, metrics=(), obj=None, feval=None, maximize=False,
    early_stopping_rounds=None, fpreproc=None, as_pandas=True,
    verbose_eval=None, show_stdv=True, seed=0, callbacks=None)
Run Code Online (Sandbox Code Playgroud)

请注意,正好有两个严格的位置参数 ( params, dtrain),第四个位置的参数是nfold

您的电话是:

xgb.cv(plst, dtrain, num_round, evallist, nfold=5, early_stopping_rounds=5) 
Run Code Online (Sandbox Code Playgroud)

当 python 解析一个函数调用时,它首先匹配你通过 position按位置传递的所有参数。所以在你的情况下,python 像这样匹配

Formal Parameter <-- What You Passed In
          params <-- plst
          dtrain <-- dtrain
 num_boost_round <-- num_round
           nfold <-- evallist
Run Code Online (Sandbox Code Playgroud)

然后 python按 name匹配您作为关键字传入的所有参数。所以在你的情况下,python 像这样匹配

Formal Parameter <-- What You Passed In
          nfold <-- 5
          early_stopping_rounds <-- 5
Run Code Online (Sandbox Code Playgroud)

所以你可以看到形参nfold被分配了两次,这就是产生这个的原因

TypeError: cv() got multiple values for keyword argument 'nfold'
Run Code Online (Sandbox Code Playgroud)

可能最简单和最清晰的解决方法是将所有参数作为关键字传递。一般来说,最好的做法是将位置参数限制在一个非常小的数量上,大多数程序员似乎最多只针对大约两个位置参数。

但我又犯了一个错误,唉,我想不通

看起来您正在传递一个需要字典的列表。再次使用文档,第一个参数:

params (dict) – 助推器参数。

应该是字典。