使用GridSearch时使用Scikit-learn的模型帮助

Question

使用GridSearch时使用Scikit-learn的模型帮助

nav*_*hai 3 python machine-learning scikit-learn cross-validation grid-search

作为安然项目的一部分,构建了附加模型,下面是步骤的摘要,

以下型号给出了非常完美的分数

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

Run Code Online (Sandbox Code Playgroud)

下面的模型给出了更合理但低分

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

Run Code Online (Sandbox Code Playgroud)

使用Kbest查找分数并对功能进行排序并尝试更高和更低分数的组合.
使用StratifiedShuffle将SVM与GridSearch一起使用
使用best_estimator_来预测和计算精度和召回率.

问题是估算器正在吐出完美的分数,在某些情况下是1

但是当我在训练数据上重新设置最佳分类器时,运行测试会得到合理的分数.

我的疑问/问题是GridSearch在使用我们发送给它的Shuffle拆分对象进行拆分后对测试数据做了什么.我认为它不适合测试数据,如果确实如此,那么当我预测使用相同的测试数据时,它不应该给出这么高的分数.因为我使用了random_state值,所以shufflesplit应该为Grid适合和预测创建相同的副本.

那么,是否使用相同的Shufflesplit两个错误？

Answer 1

Viv*_*mar 8

GridSearchCV为@ Gauthier Feuillen说,用于搜索给定数据的估计器的最佳参数.GridSearchCV的描述: -

gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params将展开以使用ParameterGrid分离所有可能的组合.
features现在将拆分features_train并features_test使用cv.同样的labels
现在gridSearch估计(管),将使用的培训features_train和labels_inner使用得分features_test和labels_test.
有关参数每个可能的组合,在步骤3,步骤4和5将重复进行cv_iterations.将计算cv迭代中的平均得分,其将被分配给该参数组合.这可以使用cv_results_gridSearch的属性访问.
对于给出最佳分数的参数,将使用这些参数重新初始化内部估计器,并重新提供提供给它的整个数据(特征和标签).

由于上一步,您在第一和第二种方法中获得不同的分数.因为在第一种方法中,所有数据都用于训练,并且您仅预测该数据.第二种方法对先前看不见的数据进行预测.

是的，当 `refit=True` 时执行第 7 步。默认情况下，GridSearchCV() 中的`refit=True`。而且你在你的代码中也没有指定 `refit` 参数，这就是我没有使用它的原因。 (2认同)

Answer 2

Gau*_*len 3

基本上网格搜索将：

尝试参数网格的每种组合
对于它们中的每一个，它都会进行 K 折交叉验证
选择可用的最佳选项。

所以你的第二个案例是好的。否则，您实际上是在预测您训练的数据（第二个选项中的情况并非如此，您只保留网格搜索中的最佳参数）

归档时间：	8 年，9 月前
查看次数：	2425 次
最近记录：	7 年，3 月前