Koc*_*r4d 5 python multithreading knn scikit-learn
我已经设置了简单的实验来检查运行sklearn GridSearchCV时多核CPU的重要性KNeighborsClassifier.我得到的结果令我感到惊讶,我想知道我是否误解了多核的好处或者我没有做对.
2-8个工作的完成时间没有差异.怎么会 ?我注意到CPU性能选项卡上的差异.当第一个单元运行时,CPU使用率为~13%,并且最后一个单元逐渐增加到100%.我期待它完成得更快.也许不是线性更快,即8个工作比4个工作快2倍,但速度要快一些.
我就是这样设置的:
我使用的是jupyter-notebook,cell指的是jupyter-notebook cell.
我已加载MNIST并使用0.05测试大小来3000表示数字X_play.
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)
Run Code Online (Sandbox Code Playgroud)
在下一个单元格中我设置KNN了一个GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
Run Code Online (Sandbox Code Playgroud)
然后我为8个n_jobs值完成了8个单元格.我的CPU是i7-4770,有4个内核8个线程.
grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)
Run Code Online (Sandbox Code Playgroud)
结果
Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 2.0min finished
Parallel(n_jobs=2)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=3)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=4)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=5)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=6)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=7)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=8)]: Done 18 out of 18 | elapsed: 1.4min finished
Run Code Online (Sandbox Code Playgroud)
第二次测试
随机森林分类器的使用情况要好得多.测试尺寸是0.5,30000图像.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]
Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished
Run Code Online (Sandbox Code Playgroud)
以下是可能导致此行为的一些原因
n_job
n_job=1和n_job=2每个线程的时间(GridSearchCV评估完全训练模型并测试它的每个模型的时间)是2.9s(总时间~2分钟)n_job=3,时间是3.4秒(总时间1.4分钟)n_job=4,时间是3.8秒(总时间58秒)n_job=5,时间是4.2秒(总时间51秒)n_job=6,时间是4.2秒(总时间~49秒)n_job=7,时间是4.2秒(总时间~49秒)n_job=8,时间是4.2秒(总时间~49秒)现在你可以看到,每个线程的时间增加但总体时间似乎减少了(虽然超出n_job=4 the different was not exactly linear) and remained constained withn_jobs> = 6`这是因为初始化和释放线程会产生成本.请参阅此github问题和此问题.
此外,可能存在其他瓶颈,例如同时向所有线程广播的数据,通过RAM(或其他资源等)的线程抢占,数据如何被推送到每个线程等.
我建议你阅读有关Ahmdal定律的内容,该定律表明通过并行化可以实现加速的理论界限
图片来源:Ahmdal定律:维基百科
最后,可能是由于数据大小和您用于培训的模型的复杂性.
这是一篇博客文章,解释了有关多线程的相同问题.
| 归档时间: |
|
| 查看次数: |
1301 次 |
| 最近记录: |