GridSearchCV - XGBoost - 提前停止

ayy*_*mbo 17 regression python-3.x scikit-learn xgboost data-science

我试图在XGBoost上使用scikit-learn的GridSearchCV进行超级计量搜索.在网格搜索期间,我希望它能够提前停止,因为它可以大大减少搜索时间,并且(期望)在我的预测/回归任务上有更好的结果.我通过其Scikit-Learn API使用XGBoost.

    model = xgb.XGBRegressor()
    GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)
Run Code Online (Sandbox Code Playgroud)

我尝试使用fit_params提供早期停止参数,但之后它会抛出此错误,这主要是因为缺少早期停止所需的验证集:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
    187         else:
    188             assert env.cvfolds is not None
    189 
    190     def callback(env):
    191         """internal function"""
--> 192         score = env.evaluation_result_list[-1][1]
        score = undefined
        env.evaluation_result_list = []
    193         if len(state) == 0:
    194             init(env)
    195         best_score = state['best_score']
    196         best_iteration = state['best_iteration']
Run Code Online (Sandbox Code Playgroud)

如何使用early_stopping_rounds在XGBoost上应用GridSearch?

注意:模型在没有gridsearch的情况下工作,GridSearch的工作也没有'fit_params = {'early_stopping_rounds':42}

小智 14

使用时,early_stopping_rounds您还必须为fit方法提供eval_metriceval_set作为输入参数.通过计算评估集上的误差来完成早期停止.错误必须减少,early_stopping_rounds否则早期停止生成额外的树.

有关详细信息,请参阅xgboosts fit方法的文档.

在这里,您可以看到最小的完整工作示例

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
         fit_params=fit_params,
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
Run Code Online (Sandbox Code Playgroud)

  • 谢谢你的回复,它的确有效.但是给出预先定义的eval_set与我猜测的交叉验证的性质相反. (4认同)
  • @glao:eval集应该是交叉验证过程的保留集,以使一切按预期工作. (4认同)
  • 现在不推荐使用“fit_params”,因为它将被弃用。 (2认同)
  • 谢谢@MichaelM,我们究竟该怎么做?任何帮助 (2认同)

emi*_*459 14

从 sklearn 0.21.3 开始,@glao 的回答和对@Vasim 的评论/问题的回复更新(请注意,已从 sklearnfit_params的实例化中GridSearchCV移出并移入fit()方法中;此外,导入还专门引入了 sklearn 包装器来自 xgboost 的模块):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)
Run Code Online (Sandbox Code Playgroud)


Jak*_*rew 5

这是一个在管道中使用 GridSearchCV 的解决方案。当您拥有预处理训练数据所需的管道时,就会出现挑战。例如,当X是文本文档时,您需要TFTDFVectorizer对其进行矢量化。

覆盖 XGBRegressor 或 XGBClssifier.fit() 函数

  • 此步骤使用 train_test_split() 从 X 中为 eval_set 选择指定数量的验证记录,然后将剩余记录传递给 fit()。
  • .fit() 中添加了一个新参数 eval_test_size 来控制验证记录的数量。(参见 train_test_split test_size文档)
  • **kwargs 传递用户为 XGBRegressor.fit() 函数添加的任何其他参数。
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs) 
Run Code Online (Sandbox Code Playgroud)

用法示例

下面是一个多步骤管道,其中包括对 X 的多个转换。管道的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES 类,形式为 xgbr__eval_test_size=200。在这个例子中:

  • X_train 包含传递到管道的文本文档。
  • XGBRegressor_ES.fit() 使用 train_test_split() 从 X_train 中选择 200 条记录作为验证集和早期停止。(这也可以是百分比,例如 xgbr__eval_test_size=0.2)
  • X_train 中的剩余记录将传递给 XGBRegressor.fit() 以进行实际的 fit()。
  • 现在,在网格搜索中每个 cv 折叠进行 75 轮未改变的提升后,可能会发生提前停止。
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
Run Code Online (Sandbox Code Playgroud)

安装管道示例:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Run Code Online (Sandbox Code Playgroud)

GridSearchCV 拟合示例:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)
Run Code Online (Sandbox Code Playgroud)