Ash*_*tad 6 scikit-learn joblib jupyter-notebook google-colaboratory gridsearchcv
在 scikit-learn 0.24.0 或更高版本中,当您使用 GridSearchCV 或 RandomizedSearchCV 并设置 n_jobs=-1 时,设置任何详细数字(1、2、3 或 100)时,不会打印任何进度消息。但是,如果您使用 scikit-learn 0.23.2 或更低版本,一切都会按预期工作,并且 joblib 会打印进度消息。
下面是一个示例代码,您可以使用它在 Google Colab 或 Jupyter Notebook 中重复我的实验:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, scoring='accuracy', refit=True, n_jobs=-1, verbose=60)
clf.fit(iris.data, iris.target)
print('Best accuracy score: %.2f' %clf.best_score_)
Run Code Online (Sandbox Code Playgroud)
使用 scikit-learn 0.23.2 的结果:
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0295s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 out of 30 | elapsed: 0.0s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 3 out of 30 | elapsed: 0.0s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 4 out of 30 | elapsed: 0.0s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 5 out of 30 | elapsed: 0.0s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 6 out of 30 | elapsed: 0.0s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 8 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 9 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 10 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 11 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 12 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 13 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 14 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 15 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 16 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 17 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 18 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 19 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 20 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 21 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 22 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 23 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 24 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 25 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 26 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 27 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 28 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 0.1s finished
Best accuracy score: 0.98
Run Code Online (Sandbox Code Playgroud)
使用 scikit-learn 0.24.0 的结果(测试到 v1.0.2):
Fitting 5 folds for each of 6 candidates, totaling 30 fits
Best accuracy score: 0.98
Run Code Online (Sandbox Code Playgroud)
在我看来,scikit-learn 0.24.0 或更高版本不会发送“详细”值,因此,当在具有“ lokyjoblib ”后端的 GridSearch 或 RandomizedSearchCV 中使用多处理器时,不会打印进度。
知道如何在 Google Colab 或 Jupyter Notebook 中解决此问题并打印 sklearn 0.24.0 或更高版本的进度日志吗?
这是获取 GridSearchCV 行为并在 Google Colab 中打印进度的迂回方法。它需要适应 RandomSearchCV 行为。
\n这需要创建训练、验证和测试集。我们将使用验证集来测试多个模型,并保存测试集以测试最终的最佳模型。
\nimport gc\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom sklearn.neighbors import KernelDensity\nfrom scipy import stats\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score\nfrom sklearn.model_selection import RandomizedSearchCV\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import train_test_split, ParameterGrid\n\n# This is based on the target and features from my dataset\ny = relationships["tmrca"]\nX = relationships.drop(columns = ["sample1", "sample2", "total_span_cM", "max_span_cM", "relationship", "tmrca"])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\nX_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.25, random_state=42)\nprint(f"X_train size: {len(X_train):,} \\nX_validation size: {len(X_validation):,} \\nX_test size: {len(X_test):,}")\nRun Code Online (Sandbox Code Playgroud)\n在这里,我们定义方法。
\ndef random_forest_tvt(para_grid, seed):\n # grid search for the hyperparameters like n_estimators, max_leaf_nodes, etc.\n # fit model on training set, tune paras on validation set, save best paras\n error_min = 1\n count = 0\n clf = RandomForestClassifier(n_jobs=-1, random_state=seed)\n num_fits = len(ParameterGrid(para_grid))\n with tqdm(total=num_fits, desc=f"Trying the models for the best fit...", file=sys.stdout) as fit_pbar:\n \n for g in ParameterGrid(para_grid):\n count += 1\n print(f"\\n{g}")\n clf.set_params(**g)\n clf.fit(X_train, y_train)\n\n y_predict_validation = clf.predict(X_validation)\n accuracy_measure = accuracy_score(y_validation, y_predict_validation)\n error_validation = 1 - accuracy_measure\n print(f"The accuracy is {accuracy_measure * 100:.2f}%.\\n")\n\n if(error_validation < error_min):\n error_min = error_validation\n best_para = g\n \n fit_pbar.update()\n \n # fitting the model on the best parameters for method output\n clf.set_params(**best_para)\n clf.fit(X_train, y_train)\n\n y_predict_train = clf.predict(X_train)\n score_train = accuracy_score(y_train, y_predict_train)\n\n y_predict_validation = clf.predict(X_validation)\n score_validation = accuracy_score(y_validation, y_predict_validation)\n\n return(best_para, score_train, score_validation)\nRun Code Online (Sandbox Code Playgroud)\n然后我们定义参数网格并调用该方法。
\nseed = 0\n\n# Number of trees in random forest\nn_estimators = [int(x) for x in np.linspace(start = 1000, stop = 5000, num = 3)]\n# Number of features to consider at every split\nmax_features = ['auto', 'sqrt']\n# Maximum number of levels in tree\nmax_depth = [int(x) for x in np.linspace(10, 110, num = 3)]\nmax_depth.append(None)\n# Minimum number of samples required to split a node\nmin_samples_split = [2, 5, 10]\n# Minimum number of samples required at each leaf node\nmin_samples_leaf = [1, 2, 4]\n# Method of selecting samples for training each tree\nbootstrap = [True]\n# Parameter Grid\nrandom_grid = {'n_estimators': n_estimators,\n 'max_features': max_features,\n 'max_depth': max_depth,\n 'min_samples_split': min_samples_split,\n 'min_samples_leaf': min_samples_leaf,\n 'bootstrap': bootstrap}\n\nprint(f"The parameter grid\\n{random_grid}\\n")\n\nbest_parameters, score_train, score_validation = random_forest_tvt(random_grid, seed)\nprint(f"\\n === Random Forest ===\\n Best parameters are: {best_parameters} \\n training score: {score_train * 100:.2f}%, validation error: {score_validation * 100:.2f}.")\nRun Code Online (Sandbox Code Playgroud)\n然后,这里是方法仍在运行时在 Google Colab 中作为输出打印的前 5 个拟合结果。
\nThe parameter grid\n{'n_estimators': [1000, 3000, 5000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 60, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True]}\n\nTrying the models for the best fit...: 0%| | 0/216 [00:00<?, ?it/s]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1000}\nThe accuracy is 85.13%.\n\nTrying the models for the best fit...: 0%| | 1/216 [00:16<58:27, 16.31s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 3000}\nThe accuracy is 85.13%.\n\nTrying the models for the best fit...: 1%| | 2/216 [01:05<2:06:44, 35.53s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 5000}\nThe accuracy is 85.10%.\n\nTrying the models for the best fit...: 1%|\xe2\x96\x8f | 3/216 [02:40<3:42:34, 62.70s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 1000}\nThe accuracy is 85.15%.\n\nTrying the models for the best fit...: 2%|\xe2\x96\x8f | 4/216 [02:56<2:36:00, 44.15s/it]\n{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 3000}\nThe accuracy is 85.14%.\n\nTrying the models for the best fit...: 2%|\xe2\x96\x8f | 5/216 [03:43<2:39:13, 45.28s/it]\nRun Code Online (Sandbox Code Playgroud)\n然后您可以用来best_paramters进行进一步的微调或在测试集上调用预测方法。
best_grid = RandomForestClassifier(n_jobs=-1, random_state=seed)\nbest_grid.set_params(**best_parameters)\nbest_grid.fit(X_train, y_train)\ny_predict_test = best_grid.predict(X_test)\nscore_test = accuracy_score(y_test, y_predict_test)\nprint(f"{score_test:.2f}%")\nRun Code Online (Sandbox Code Playgroud)\n您需要做进一步的调整才能使其执行 k 倍行为。目前,每个模型将在训练集上测试一次,在验证集上测试一次,每个模型总共测试两次。然后将第三次测试具有最佳参数的模型以产生输出。最后,您可以使用输出参数进行进一步的微调(此处未显示)或在测试集上调用预测方法。
\n| 归档时间: |
|
| 查看次数: |
1286 次 |
| 最近记录: |