scikit-learn 中的 StratifiedKFold 与 KFold

5 python machine-learning scikit-learn

我使用此代码来测试KFoldStratifiedKFold.

import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold

X = np.array([
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74]
])

y = np.array([0,0,0,0,1,1,1,1])

sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
floder = KFold(n_splits=4,random_state=0,shuffle=False)

for train, test in sfolder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
print("StratifiedKFold done")

for train, test in floder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
print("KFold done")
Run Code Online (Sandbox Code Playgroud)

我发现StratifiedKFold可以保持标签的比例,但KFold不能。

Train: [1 2 3 5 6 7] | test: [0 4]
Train: [0 2 3 4 6 7] | test: [1 5]
Train: [0 1 3 4 5 7] | test: [2 6]
Train: [0 1 2 4 5 6] | test: [3 7]
StratifiedKFold done
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
KFold done
Run Code Online (Sandbox Code Playgroud)

好像StratifiedKFold比较好,所以KFold不应该用?

什么时候使用KFold而不是StratifiedKFold

Jay*_*hai 13

I think you should ask "When to use StratifiedKFold instead of KFold?".

You need to know what "KFold" and "Stratified" are first.

KFold is a cross-validator that divides the dataset into k folds.

Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

So, it means that StratifiedKFold is the improved version of KFold

Therefore, the answer to this question is we should prefer StratifiedKFold over KFold when dealing with classifications tasks with imbalanced class distributions.


FOR EXAMPLE

Suppose that there is a dataset with 16 data points and imbalanced class distribution. In the dataset, 12 of data points belong to class A and the rest (i.e. 4) belong to class B. The ratio of class B to class A is 1/3. If we use StratifiedKFold and set k = 4, then the training sets will include 3 data points from class A and 9 data points from class B and the test sets include 3 data points from class A and 1 data point from class B.

正如我们所看到的,数据集的类分布被StratifiedKFold保留在分割中,而KFold没有考虑到这一点。