Scikit-learn 0.24.0 或更高版本中的 GridSearchCV 和 RandomizedSearchCV 不打印 n_jobs=-1 的进度日志

Ash*_*tad 6 scikit-learn joblib jupyter-notebook google-colaboratory gridsearchcv

在 scikit-learn 0.24.0 或更高版本中,当您使用 GridSearchCV 或 RandomizedSearchCV 并设置 n_jobs=-1 时,设置任何详细数字(1、2、3 或 100)时,不会打印任何进度消息。但是,如果您使用 scikit-learn 0.23.2 或更低版本,一切都会按预期工作,并且 joblib 会打印进度消息。

下面是一个示例代码,您可以使用它在 Google Colab 或 Jupyter Notebook 中重复我的实验:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10]}
svc = svm.SVC()

clf = GridSearchCV(svc, parameters, scoring='accuracy', refit=True, n_jobs=-1, verbose=60)
clf.fit(iris.data, iris.target)
print('Best accuracy score: %.2f' %clf.best_score_)
Run Code Online (Sandbox Code Playgroud)

使用 scikit-learn 0.23.2 的结果:

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0295s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   2 out of  30 | elapsed:    0.0s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   3 out of  30 | elapsed:    0.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   4 out of  30 | elapsed:    0.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   5 out of  30 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of  30 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   7 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   8 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   9 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  11 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  12 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  13 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  14 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  15 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  16 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  17 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  18 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  21 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  22 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  25 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  26 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  28 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s finished
Best accuracy score: 0.98
Run Code Online (Sandbox Code Playgroud)

使用 scikit-learn 0.24.0 的结果(测试到 v1.0.2):

Fitting 5 folds for each of 6 candidates, totaling 30 fits
Best accuracy score: 0.98
Run Code Online (Sandbox Code Playgroud)

在我看来,scikit-learn 0.24.0 或更高版本不会发送“详细”值,因此,当在具有“ lokyjoblib ”后端的 GridSearch 或 RandomizedSearchCV 中使用多处理器时,不会打印进度。

知道如何在 Google Colab 或 Jupyter Notebook 中解决此问题并打印 sklearn 0.24.0 或更高版本的进度日志吗?

LaK*_*vid 1

这是获取 GridSearchCV 行为并在 Google Colab 中打印进度的迂回方法。它需要适应 RandomSearchCV 行为。

\n

这需要创建训练、验证和测试集。我们将使用验证集来测试多个模型,并保存测试集以测试最终的最佳模型。

\n
import gc\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom sklearn.neighbors import KernelDensity\nfrom scipy import stats\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score\nfrom sklearn.model_selection import RandomizedSearchCV\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import train_test_split, ParameterGrid\n\n# This is based on the target and features from my dataset\ny = relationships["tmrca"]\nX = relationships.drop(columns = ["sample1", "sample2", "total_span_cM", "max_span_cM", "relationship", "tmrca"])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\nX_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.25, random_state=42)\nprint(f"X_train size: {len(X_train):,} \\nX_validation size: {len(X_validation):,} \\nX_test size: {len(X_test):,}")\n
Run Code Online (Sandbox Code Playgroud)\n

在这里,我们定义方法。

\n
def random_forest_tvt(para_grid, seed):\n    # grid search for the hyperparameters like n_estimators, max_leaf_nodes, etc.\n    # fit model on training set, tune paras on validation set, save best paras\n    error_min = 1\n    count = 0\n    clf = RandomForestClassifier(n_jobs=-1, random_state=seed)\n    num_fits = len(ParameterGrid(para_grid))\n    with tqdm(total=num_fits, desc=f"Trying the models for the best fit...", file=sys.stdout) as fit_pbar:\n        \n        for g in ParameterGrid(para_grid):\n            count += 1\n            print(f"\\n{g}")\n            clf.set_params(**g)\n            clf.fit(X_train, y_train)\n\n            y_predict_validation = clf.predict(X_validation)\n            accuracy_measure = accuracy_score(y_validation, y_predict_validation)\n            error_validation = 1 - accuracy_measure\n            print(f"The accuracy is {accuracy_measure * 100:.2f}%.\\n")\n\n            if(error_validation < error_min):\n                error_min = error_validation\n                best_para = g\n            \n            fit_pbar.update()\n    \n    # fitting the model on the best parameters for method output\n    clf.set_params(**best_para)\n    clf.fit(X_train, y_train)\n\n    y_predict_train =  clf.predict(X_train)\n    score_train = accuracy_score(y_train, y_predict_train)\n\n    y_predict_validation =  clf.predict(X_validation)\n    score_validation = accuracy_score(y_validation, y_predict_validation)\n\n    return(best_para, score_train, score_validation)\n
Run Code Online (Sandbox Code Playgroud)\n

然后我们定义参数网格并调用该方法。

\n
seed = 0\n\n# Number of trees in random forest\nn_estimators = [int(x) for x in np.linspace(start = 1000, stop = 5000, num = 3)]\n# Number of features to consider at every split\nmax_features = ['auto', 'sqrt']\n# Maximum number of levels in tree\nmax_depth = [int(x) for x in np.linspace(10, 110, num = 3)]\nmax_depth.append(None)\n# Minimum number of samples required to split a node\nmin_samples_split = [2, 5, 10]\n# Minimum number of samples required at each leaf node\nmin_samples_leaf = [1, 2, 4]\n# Method of selecting samples for training each tree\nbootstrap = [True]\n# Parameter Grid\nrandom_grid = {'n_estimators': n_estimators,\n               'max_features': max_features,\n               'max_depth': max_depth,\n               'min_samples_split': min_samples_split,\n               'min_samples_leaf': min_samples_leaf,\n               'bootstrap': bootstrap}\n\nprint(f"The parameter grid\\n{random_grid}\\n")\n\nbest_parameters, score_train, score_validation = random_forest_tvt(random_grid, seed)\nprint(f"\\n === Random Forest ===\\n Best parameters are: {best_parameters} \\n training score: {score_train * 100:.2f}%, validation error: {score_validation * 100:.2f}.")\n
Run Code Online (Sandbox Code Playgroud)\n

然后,这里是方法仍在运行时在 Google Colab 中作为输出打印的前 5 个拟合结果。

\n
The parameter grid\n{'n_estimators': [1000, 3000, 5000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 60, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True]}\n\nTrying the models for the best fit...:   0%|          | 0/216 [00:00<?, ?it/s]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1000}\nThe accuracy is 85.13%.\n\nTrying the models for the best fit...:   0%|          | 1/216 [00:16<58:27, 16.31s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 3000}\nThe accuracy is 85.13%.\n\nTrying the models for the best fit...:   1%|          | 2/216 [01:05<2:06:44, 35.53s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 5000}\nThe accuracy is 85.10%.\n\nTrying the models for the best fit...:   1%|\xe2\x96\x8f         | 3/216 [02:40<3:42:34, 62.70s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 1000}\nThe accuracy is 85.15%.\n\nTrying the models for the best fit...:   2%|\xe2\x96\x8f         | 4/216 [02:56<2:36:00, 44.15s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 3000}\nThe accuracy is 85.14%.\n\nTrying the models for the best fit...:   2%|\xe2\x96\x8f         | 5/216 [03:43<2:39:13, 45.28s/it]\n
Run Code Online (Sandbox Code Playgroud)\n

然后您可以用来best_paramters进行进一步的微调或在测试集上调用预测方法。

\n
best_grid = RandomForestClassifier(n_jobs=-1, random_state=seed)\nbest_grid.set_params(**best_parameters)\nbest_grid.fit(X_train, y_train)\ny_predict_test =  best_grid.predict(X_test)\nscore_test = accuracy_score(y_test, y_predict_test)\nprint(f"{score_test:.2f}%")\n
Run Code Online (Sandbox Code Playgroud)\n

您需要做进一步的调整才能使其执行 k 倍行为。目前,每个模型将在训练集上测试一次,在验证集上测试一次,每个模型总共测试两次。然后将第三次测试具有最佳参数的模型以产生输出。最后,您可以使用输出参数进行进一步的微调(此处未显示)或在测试集上调用预测方法。

\n