XGBoost 和 scikit-optimize：BayesSearchCV 和 XGBRegressor 不兼容 - 为什么？

Question

XGBoost 和 scikit-optimize：BayesSearchCV 和 XGBRegressor 不兼容 - 为什么？

Ele*_*Ant 5 python multithreading xgboost scikit-optimize bayesian-deep-learning

我有一个非常大的数据集（700 万行，54 个特征），我想拟合回归模型以使用XGBoost. 为了训练最好的模型，我想使用BayesSearchCVfromscikit-optimize对不同的超参数组合重复运行拟合，直到找到性能最佳的集合。

对于给定的超参数集，XGBoost需要很长时间来训练模型，因此为了找到最佳超参数而无需花费数天时间处理训练折叠、超参数等的每个排列，我想同时对XGBoost和进行多线程处理BayesSearchCV。我的代码的相关部分如下所示：

xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 1))])

xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}

xgb_kfold = KFold(n_splits = 5, random_state = 42)

xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)

xgb_unsm_cv.fit(X_train.values, y_train.values)

Run Code Online (Sandbox Code Playgroud)

但是，我发现n_jobs > 1在BayesSearchCV调用时，fit 崩溃了，并且出现以下错误：

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Run Code Online (Sandbox Code Playgroud)

每当我在BayesSearchCV调用中使用 1 个以上的线程时，此错误就会持续存在，并且与我提供的内存无关。

这是XGBoost和之间的一些根本不兼容scikit-optimize，还是可以强制两个包以某种方式一起工作？如果没有某种多线程优化方法，我担心拟合我的模型需要数周时间才能执行。我能做些什么来解决这个问题？

Answer 1

ina*_*tus 4

我不认为该错误与库的不兼容有关。相反，由于您要求两个不同的多线程操作，因此您会耗尽内存，因为您的程序试图将完整的数据集放入 RAM 中，对于多个实例不是一次而是两次（取决于线程）。

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Run Code Online (Sandbox Code Playgroud)

分段错误是指系统耗尽可用内存的错误。

请注意，XGBoost 是一个消耗 RAM 的野兽，将其与另一个多线程操作结合起来必然会造成损失（就个人而言，不建议与日常驱动机器一起使用。）

最可行的解决方案可能是使用 Google 的 TPU 或其他一些云服务（注意成本），或者使用某种技术来减小数据集的大小，以便使用一些统计技术进行处理，例如本Kaggle 笔记本和数据中提到的技术科学 StackExchange 文章。

这个想法是，要么升级硬件（金钱成本），要么直接使用单线程 BayesianCV（时间成本），要么使用最适合您的技术缩小数据规模。

最后，答案仍然是这些库可能是兼容的，只是数据对于可用 RAM 来说太大了。

归档时间：	4 年，4 月前
查看次数：	173 次
最近记录：	4 年，4 月前