使用 sklearn 在嵌套交叉验证中使用 GroupKFold

Question

使用 sklearn 在嵌套交叉验证中使用 GroupKFold

Sör*_*ler 5 python scikit-learn cross-validation

我的代码基于sklearn网站上的示例：https ://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

\n

我正在尝试在内部和外部简历中使用 GroupKFold。

\n

from sklearn.datasets import load_iris\nfrom matplotlib import pyplot as plt\nfrom sklearn.svm import SVC\nfrom sklearn.model_selection import GridSearchCV, cross_val_score, KFold,GroupKFold\nimport numpy as np\n\n# Load the dataset\niris = load_iris()\nX_iris = iris.data\ny_iris = iris.target\n\n# Set up possible values of parameters to optimize over\np_grid = {"C": [1, 10, 100],\n          "gamma": [.01, .1]}\n\n# We will use a Support Vector Classifier with "rbf" kernel\nsvm = SVC(kernel="rbf")\n\n# Choose cross-validation techniques for the inner and outer loops,\n# independently of the dataset.\n# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.\ninner_cv = GroupKFold(n_splits=3)\nouter_cv = GroupKFold(n_splits=3)\n\n# Non_nested parameter search and scoring\nclf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)\n\n# Nested CV with parameter optimization\nnested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris)\n

Run Code Online (Sandbox Code Playgroud)\n

我知道将 y 值放入 groups 参数并不是它的用途！\n对于此代码，我收到以下错误。

\n

.../anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: \nValueError: The \'groups\' parameter should not be None.\n

Run Code Online (Sandbox Code Playgroud)\n

有人知道如何解决这个问题吗？

\n

提前谢谢你的帮助，

\n

S\xc3\xb6ren

\n

Answer 1

Gio*_*ano 6

我遇到了类似的问题，我发现@Samalama 的解决方案是一个很好的解决方案。我唯一需要改变的是通话内容fit。我也必须将其切片，与火车组的groups形状相同。否则，我会收到一条错误消息，指出三个对象的形状不相同。这是正确的实现吗？Xy

for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

Run Code Online (Sandbox Code Playgroud)

Answer 2

ywb*_*aek 0

正如您从的文档中看到的GroupKFold，
当您想要K-fold 具有不重叠的组时，可以使用它。
这意味着除非您在创建 K 折叠时需要分离不同的数据组，否则不要使用此方法。

话虽这么说，对于给定的示例，您必须手动创建groups，
它应该是一个类似于数组的对象，其形状与您的y.
和

不同组的数量必须至少等于折叠的数量

以下是文档中的示例代码：

import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

Run Code Online (Sandbox Code Playgroud)

您可以看到与groups具有相同的形状y，
并且它有两个不同的组0, 2，其折叠数相同。

编辑：
get_n_splits(groups)对象的方法GroupKFold返回交叉验证器中的分割迭代次数cv，我们可以将其作为关键字参数传递给cross_val_score函数。

clf = GridSearchCV(estimator=svm, 
                   param_grid=p_grid, 
                   cv=inner_cv.get_n_splits(groups=y_iris))

nested_score = cross_val_score(clf, X=X_iris, y=y_iris, 
                               cv=outer_cv.get_n_splits(groups=y_iris))

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，7 月前
查看次数：	7580 次
最近记录：	2 年，11 月前