ValueError:n_splits = 10不能大于每个类中的成员数

SFC*_*SFC 3 python scikit-learn cross-validation

我试图运行以下代码:

from sklearn.model_selection import StratifiedKFold 
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]

skf = StratifiedKFold(n_splits=10)

for train, test in skf.split(X,y):  
    print("%s %s" % (train,test))
Run Code Online (Sandbox Code Playgroud)

但是我收到以下错误:

ValueError: n_splits=10 cannot be greater than the number of members in each class.
Run Code Online (Sandbox Code Playgroud)

我在这里看了scikit-learn错误:y中人口最少的类只有1个成员,但我仍然不确定我的代码有什么问题.

我的名单都有14个长度print(len(X)) print(len(y)).

令我感到困惑的部分原因是我不确定在这种背景下members定义了什么以及什么是定义class.

问题:如何修复错误?什么是会员?什么是课程?(在此背景下)

Viv*_*mar 9

分层意味着保持每个级别中每个级别的比例.因此,如果您的原始数据集有3个类别,比例分别为60%,20%和20%,那么分层将尝试在每个折叠中保持该比率.

在你的情况下,

X = ["hey", "join now", "hello", "join today", "join us now", "not today",
     "join this trial", " hey hey", " no", "hola", "bye", "join today", 
     "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]
Run Code Online (Sandbox Code Playgroud)

您总共有14个样本(成员)的分布:

class    number of members         percentage
 'n'        9                        64
 'r'        3                        22
 'y'        2                        14
Run Code Online (Sandbox Code Playgroud)

所以StratifiedKFold将尝试在每个折叠中保持这个比例.现在你指定了10倍(n_splits).所以这意味着在单个折叠中,对于'y'级来保持比例,至少2/10 = 0.2个成员.但是我们不能给少于1个成员(样本),所以这就是为什么它在那里抛出一个错误.

如果n_splits=10你没有设置n_splits=2,那么它就会有效,因为'y'的成员数量将是2/2 = 1.为了n_splits = 10正常工作,你需要每个类至少有10个样本.