参数"stratify"来自方法"train_test_split"(scikit Learn)

Question

参数"stratify"来自方法"train_test_split"(scikit Learn)

Dan*_*vaw 67 test-data split training-data scikit-learn

我试图使用train_test_split包scikit Learn,但我遇到参数问题stratify.以下是代码:

from sklearn import cross_validation, datasets 

X = iris.data[:,:2]
y = iris.target

cross_validation.train_test_split(X,y,stratify=y)

Run Code Online (Sandbox Code Playgroud)

但是,我一直遇到以下问题:

raise TypeError("Invalid parameters passed: %s" % str(options))
TypeError: Invalid parameters passed: {'stratify': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

Run Code Online (Sandbox Code Playgroud)

有人知道发生了什么吗？以下是功能文档.

[...]

stratify:array-like或None(默认为None)

如果不是None,则数据以分层方式分割,使用此作为标签数组.

版本0.17中的新功能:分层拆分

[...]

Answer 1

Faz*_*ini 213

此stratify参数进行拆分,以便生成的样本中的值的比例与提供给参数的值的比例相同stratify.

例如,如果变量y是值的二进制分类变量0和1并有零点的25%和一的75%,stratify=y将确保您的随机分割时有25%0的和75%1的.

这并没有真正回答这个问题,但对于理解它的工作方式非常有用.万分感谢. (64认同)
@HolgerBrandl它将被平均保留; 有层次,它将被保留肯定. (7认同)
@HolgerBrandl具有非常小的数据集或非常不平衡的数据集，随机拆分很可能会从其中一个拆分中完全消除一个类。 (4认同)
@HolgerBrandl 好问题！也许我们可以首先添加一点，你必须使用“stratify”分成训练集和测试集。其次，为了纠正不平衡，您最终需要在训练集上进行过采样或欠采样。许多 Sklearn 分类器都有一个称为类权重的参数，您可以将其设置为平衡。最后，对于不平衡数据集，您还可以采用比准确性更合适的指标。尝试按 F1 或 ROC 下的区域。 (3认同)
我仍然很难理解，为什么需要这种分层：如果数据中存在类内不平衡，那么在对数据进行随机分割时是否会平均保留它？ (2认同)

Answer 2

Bor*_*rja 47

Scikit-Learn只是告诉你它不能识别"分层"这个论点,而不是你错误地使用它.这是因为参数是在0.17版本中添加的,如您引用的文档中所示.

所以你只需要更新Scikit-Learn.

Answer 3

Mar*_*oma 40

对于通过Google来到这里的未来自我:

train_test_split现在model_selection,因此:

from sklearn.model_selection import train_test_split

# given:
# features: xs
# ground truth: ys

x_train, x_test, y_train, y_test = train_test_split(xs, ys,
                                                    test_size=0.33,
                                                    random_state=0,
                                                    stratify=ys)

Run Code Online (Sandbox Code Playgroud)

是使用它的方式.设定random_state对于再现性是可取的.

Answer 4

The*_*Guy 13

我可以给出的答案是，分层保留了数据在目标列中的分布比例，并在train_test_split. 举例来说，如果问题是二元分类问题，并且目标列的比例为：

80%=yes
20%=no

'yes'由于比目标列多了 4 倍，如果不'no'分层地分成训练和测试，我们可能会遇到只有落入训练集而所有落入测试集的麻烦。（即，训练集的目标列中可能没有）'yes''no''no'

因此，通过分层，target该列：

训练集有80%和20%，并且'yes'，'no'

测试集分别有80%'yes'和20 %'no'。

因此，stratify使（标签）在训练和测试集中均匀分布target- 就像它在原始数据集中分布一样。

from sklearn.model_selection import train_test_split X_train, y_train, X_test, y_test = train_test_split(features, target, test-size = 0.25, stratify = target, random_state = 43)
Run Code Online (Sandbox Code Playgroud)

Answer 5

X. *_*ang 11

在此上下文中,分层意味着train_test_split方法返回具有与输入数据集相同比例的类标签的训练和测试子集.

Answer 6

Ser*_*nov 6

尝试运行此代码，它“正常工作”：

from sklearn import cross_validation, datasets 

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X,y,train_size=.8, stratify=y)

y_test

array([0, 0, 0, 0, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 2,
       1, 2, 1, 1, 0, 2, 1])

Run Code Online (Sandbox Code Playgroud)

@user5767535 顺便说一句，“0.17 版本中的新功能：分层分割”让我几乎可以肯定你必须更新你的“sklearn”...... (3认同)

归档时间：	10 年，1 月前
查看次数：	78473 次
最近记录：	7 年，3 月前