Tit*_*llo 9 python scikit-learn cross-validation grid-search
我正在尝试为SVR模型获取最佳参数集.我想使用GridSearchCV
超过不同的值C
.但是,从之前的测试中我发现,分成训练/测试集高可影响整体表现(在这种情况下为r2).为了解决这个问题,我想实现重复的5倍交叉验证(10 x 5CV).是否有内置的方式来执行它GridSearchCV
?
快速解决方案:
遵循sci-kit 官方文档中提出的想法,快速解决方案代表:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
Run Code Online (Sandbox Code Playgroud)
Viv*_*mar 19
这称为嵌套的cross_validation.您可以查看官方文档示例,以指导您进入正确的方向,并在此处查看我的其他答案,以获得类似的方法.
您可以根据需要调整步骤:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Run Code Online (Sandbox Code Playgroud)
编辑 - 用cross_val_score()
和的嵌套交叉验证的描述GridSearchCV()
clf, X, y, outer_cv
给cross_val_score
X
将分为X_outer_train, X_outer_test
使用outer_cv
.同样的y.X_outer_test
将被阻止X_outer_train
并将传递给clf for fit()(在我们的例子中为GridSearchCV).假设从这里X_outer_train
调用X_inner
,因为它被传递给内部估计器,假设y_outer_train
是y_inner
.X_inner
现在将拆分为GridSearchCV X_inner_train
并在其中X_inner_test
使用inner_cv
.同样的yX_inner_train
和y_train_inner
使用得分X_inner_test
和y_inner_test
.(X_inner_train, X_inner_test)
最好的超参数被传递给clf.best_estimator_
所有数据并且适合所有数据,即X_outer_train
.clf
(gridsearch.best_estimator_
)将然后使用计分X_outer_test
和y_outer_test
.cross_val_score
nested_score
.Ada*_*mRH 13
您可以提供不同的交叉验证生成器GridSearchCV
.二进制或多类分类问题的默认值是StratifiedKFold
.否则,它使用KFold
.但是你可以提供自己的.在你的情况下,它看起来像你想要RepeatedKFold
或RepeatedStratifiedKFold
.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)
Run Code Online (Sandbox Code Playgroud)