Jea*_*nne 24 machine-learning svm scikit-learn cross-validation
我正在使用sklearn进行多分类任务.我需要将alldata拆分为train_set和test_set.我想从每个班级中随机抽取相同的样本编号.实际上,我有趣的是这个功能
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
Run Code Online (Sandbox Code Playgroud)
但它给出了不平衡的数据集!任何建议.
Gui*_*sch 24
虽然克里斯蒂安的建议是正确的,但在技术上train_test_split应该通过使用stratify参数给你分层的结果.
所以你可以这样做:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
Run Code Online (Sandbox Code Playgroud)
这里的技巧是,它从版本开始 0.17在sklearn.
从有关参数的文档stratify:
stratify:array-like或None(默认为None)如果不是None,则数据以分层方式分割,使用此作为标签数组.版本0.17中的新功能:分层拆分
Chr*_*sch 17
您可以使用StratifiedShuffleSplit创建具有与原始类相同百分比的数据集:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
X_train=X[train_idx]
y_train=y[train_idx]
print(X_train)
# [[3 7]
# [2 4]]
print(y_train)
# [1 0]
Run Code Online (Sandbox Code Playgroud)
ant*_*ike 10
如果班级不平衡,但您希望拆分平衡,那么分层将无济于事。似乎没有一种在 sklearn 中进行平衡采样的方法,但是使用基本的 numpy 很容易,例如这样的函数可能会对您有所帮助:
def split_balanced(data, target, test_size=0.2):
classes = np.unique(target)
# can give test_size as fraction of input data size of number of samples
if test_size<1:
n_test = np.round(len(target)*test_size)
else:
n_test = test_size
n_train = max(0,len(target)-n_test)
n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
n_test_per_class = max(1,int(np.floor(n_test/len(classes))))
ixs = []
for cl in classes:
if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
# if data has too few samples for this class, do upsampling
# split the data to training and testing before sampling so data points won't be
# shared among training and test data
splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
else:
ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
replace=False))
# take same num of samples from all classes
ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])
X_train = data[ix_train,:]
X_test = data[ix_test,:]
y_train = target[ix_train]
y_test = target[ix_test]
return X_train, X_test, y_train, y_test
Run Code Online (Sandbox Code Playgroud)
请注意,如果您使用此方法并在每个类中采样比输入数据中更多的点,那么这些点将被上采样(带替换的采样)。因此,某些数据点会出现多次,这可能会影响准确性度量等。如果某个类只有一个数据点,则会出现错误。您可以轻松地检查每个班级的点数,例如np.unique(target, return_counts=True)