GridSearchCV在scikit-learn中的小数据集上非常慢

Ame*_*ina 6 python numpy scikit-learn

这很奇怪.我可以成功运行这个例子grid_search_digits.py.但是,我无法对自己的数据进行网格搜索.

我有以下设置:

import sklearn
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import LeaveOneOut
from sklearn.metrics import auc_score

# ... Build X and y ....

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

loo = LeaveOneOut(len(y))
clf = GridSearchCV(SVC(C=1), tuned_parameters, score_func=auc_score)
clf.fit(X, y, cv=loo)
....
print clf.best_estimator_
....
Run Code Online (Sandbox Code Playgroud)

但我永远不会过去clf.fit(我离开它运行约1小时).

我也尝试过

clf.fit(X, y, cv=10)
Run Code Online (Sandbox Code Playgroud)

skf = StratifiedKFold(y,2)
clf.fit(X, y, cv=skf)
Run Code Online (Sandbox Code Playgroud)

并有同样的问题(它永远不会完成clf.fit语句).我的数据很简单:

> X.shape
(27,26)

> y.shape
27

> numpy.sum(y)
5

> y.dtype
dtype('int64')


>?y
Type:       ndarray
String Form:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1]
Length:     27
File:       /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py                                                
Docstring:  <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

> ?X
Type:       ndarray
String Form:
       [[ -3.61238468e+03  -3.61253920e+03  -3.61290196e+03  -3.61326679e+03
           7.84590361e+02   0.0000 <...> 0000e+00   2.22389150e+00   2.53252959e+00 
           2.11606216e+00  -1.99613432e+05  -1.99564828e+05]]
Length:     27
File:       /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py                                                
Docstring:  <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)
Run Code Online (Sandbox Code Playgroud)

这是最新版本的scikit-learn(0.13.1)和:

$ pip freeze
Cython==0.19.1
PIL==1.1.7
PyXB==1.2.2
PyYAML==3.10
argparse==1.2.1
distribute==0.6.34
epc==0.0.5
ipython==0.13.2
jedi==0.6.0
matplotlib==1.3.x
nltk==2.0.4
nose==1.3.0
numexpr==2.1
numpy==1.7.1
pandas==0.11.0
pyparsing==1.5.7
python-dateutil==2.1
pytz==2013b
rpy2==2.3.1
scikit-learn==0.13.1
scipy==0.12.0
sexpdata==0.0.3
six==1.3.0
stemming==1.0.1
-e git+https://github.com/PyTables/PyTables.git@df7b20444b0737cf34686b5d88b4e674ec85575b#egg=tables-dev
tornado==3.0.1
wsgiref==0.1.2
Run Code Online (Sandbox Code Playgroud)

奇怪的是,适合单个SVM非常快:

>  %timeit clf2 = svm.SVC(); clf2.fit(X,y)                                                                                                             
1000 loops, best of 3: 328 us per loop
Run Code Online (Sandbox Code Playgroud)

更新

我注意到如果我预先缩放数据:

from sklearn import preprocessing
X = preprocessing.scale(X) 
Run Code Online (Sandbox Code Playgroud)

网格搜索非常快.

为什么?为什么GridSearchCV对于缩放比如此敏感而常规svm.SVC().fit不是?

use*_*197 7

如前所述,SVM基于分类器(as y == np.int*)的 预处理是必须的,否则ML-Estimator的预测能力会因偏斜特征对衰减函数的影响而丢失.

由于反对处理时间:

  • 试着更好地了解你的AI/ML-Model Overfit/Generalization [C,gamma]景观
  • 尝试在初始AI/ML过程调整中添加详细程度
  • 尝试将n_jobs添加到数字运算中
  • 如果规模要求,尝试将网格计算添加到您的计算方法中

.

aGrid = aML_GS.GridSearchCV( aClassifierOBJECT, param_grid = aGrid_of_parameters, cv = cv, n_jobs = n_JobsOnMultiCpuCores, verbose = 5 )
Run Code Online (Sandbox Code Playgroud)

有时候,GridSearchCV()即使在使用了上述所有提示之后,确实可以占用大量的CPU时间/ CPU-poolOfRESOURCE.

因此,如果您确定功能工程,数据健全性和FeatureDOMAIN预处理正确完成,请保持冷静并且不要惊慌.

[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min
[GridSearchCV] C=16777216.0, gamma=0.5 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 -  25.4s
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 -  44.9s
[Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished
Run Code Online (Sandbox Code Playgroud)

正如上面提到的"......常规svm.SVC().fit"通知,它使用默认[C,gamma]值,因此与您的Model/ProblemDOMAIN的行为无关.

回复:更新

哦确实,SVM输入的正则化/缩放是这个AI/ML工具的强制任务.scikit-learn有一个很好的工具来生成和重复使用aScalerOBJECT先验扩展(在aDataSET进入之前.fit())和事后临时扩展,一旦你需要重新扩展一个新的例子并将其发送到预测器通过请求回答它的魔力 anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )

(是的,aNewExampleX可能是一个矩阵,所以要求对几个答案进行"矢量化"处理)

O(M ^ 2.N ^ 1)计算复杂度的性能缓解

与下面发布的猜测相比,问题 - " 宽度 ",测量为N==矩阵中的一些SVM特征X将被归咎于整个计算时间,具有rbf-kernel的SVM分类器是按设计的O(M^2.N^1)问题.

因此,对观察总数(例子)存在二次依赖,进入训练(.fit())或CrossValidation阶段,人们很难说,如果一个"减少"(线性),监督学习分类器将获得更好的预测能力.只有)特征的"宽度",它本身承担了SVM分类器构造的预测能力的输入,不是吗?