贝叶斯优化在 CatBoost 中的应用

prp*_*prp 6 python bayesian python-3.x pandas catboost

这是我在 CatBoost 中应用 BayesSearch 的尝试:

from catboost import CatBoostClassifier
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold


# Classifier
bayes_cv_tuner = BayesSearchCV(
estimator = CatBoostClassifier(
silent=True
),
search_spaces = {
'depth':(2,16),
'l2_leaf_reg':(1, 500),
'bagging_temperature':(1e-9, 1000, 'log-uniform'),
'border_count':(1,255),
'rsm':(0.01, 1.0, 'uniform'),
'random_strength':(1e-9, 10, 'log-uniform'),
'scale_pos_weight':(0.01, 1.0, 'uniform'),
},
scoring = 'roc_auc',
cv = StratifiedKFold(
n_splits=2,
shuffle=True,
random_state=72
),
n_jobs = 1,
n_iter = 100,
verbose = 1,
refit = True,
random_state = 72
)
Run Code Online (Sandbox Code Playgroud)

跟踪结果:

def status_print(optim_result):
"""Status callback durring bayesian hyperparameter search"""

# Get all the models tested so far in DataFrame format
all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)    

# Get current parameters and the best parameters    
best_params = pd.Series(bayes_cv_tuner.best_params_)
print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(
    len(all_models),
    np.round(bayes_cv_tuner.best_score_, 4),
    bayes_cv_tuner.best_params_
))
Run Code Online (Sandbox Code Playgroud)

拟合贝叶斯CV

resultCAT = bayes_cv_tuner.fit(X_train, y_train, callback=status_print)
Run Code Online (Sandbox Code Playgroud)

结果

前 3 次迭代工作正常,但随后我得到一个不间断的字符串:

Iteration with suspicious time 7.55 sec ignored in overall statistics.

Iteration with suspicious time 739 sec ignored in overall statistics.
Run Code Online (Sandbox Code Playgroud)

(...)

关于我哪里出错了的任何想法/我该如何改进?

敬礼,

Luc*_*ron 4

One of the iterations in the set of experiments skopt is arranging is actually taking too long to complete, based on the timings that CatBoost has up so far recorded.

If you explore when this happens by setting the verbosity of the classifier and you use a callback to explore what combination of parameters skopt is exploring, you may find that the culprit is most likely the depth parameters: Skopt will slow down when CatBoost is trying to test deeper trees.

You can try to debug too using this custom callback:

counter = 0
def onstep(res):
    global counter
    args = res.x
    x0 = res.x_iters
    y0 = res.func_vals
    print('Last eval: ', x0[-1], 
          ' - Score ', y0[-1])
    print('Current iter: ', counter, 
          ' - Score ', res.fun, 
          ' - Args: ', args)
    joblib.dump((x0, y0), 'checkpoint.pkl')
    counter = counter+1
Run Code Online (Sandbox Code Playgroud)

You can call it by:

resultCAT = bayes_cv_tuner.fit(X_train, y_train, callback=[onstep, status_print])
Run Code Online (Sandbox Code Playgroud)

Actually I've noticed the same problem as yours in my experiments, the complexity raises in a non-linear way as the depth increases and thus CatBoost takes longer time to complete its iterations. A simple solution is to try searching a simpler space:

'depth':(2, 8)
Run Code Online (Sandbox Code Playgroud)

Usually depth 8 is enough, anyway, you can first run skopt with maximum depth equal to 8 and then re-iterate by increasing the maximum.