如何从sklearn中的不平衡数据集中获得平衡的类样本?

Kri*_*yan 8 scikit-learn

我有一个带有二进制类标签的数据集。我想从我的数据集中提取具有平衡类的样本。我在下面写的代码给了我不平衡的数据集。

sss = StratifiedShuffleSplit(train_size=5000, n_splits=1, test_size=50000, random_state=0)
for train_index, test_index in sss.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        print(itemfreq(y_train))
Run Code Online (Sandbox Code Playgroud)

如您所见,该类0有 2438 个样本,而该类1有 2562 个。

[[  0.00000000e+00   2.43800000e+03]
 [  1.00000000e+00   2.56200000e+03]]
Run Code Online (Sandbox Code Playgroud)

我应该如何继续在课堂上获得 2500 个样本,1并且0每个样本都在我的训练集中。(测试集也有 25000)

Ton*_*has 6

由于您没有向我们提供数据集,我使用的是通过make_blobs. 从您的问题中尚不清楚应该有多少测试样本。我已定义,test_samples = 50000但您可以更改此值以满足您的需要。

from sklearn import datasets

train_samples = 5000
test_samples = 50000
total_samples = train_samples + train_samples
X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0)
Run Code Online (Sandbox Code Playgroud)

以下代码段将数据拆分为具有平衡类的训练和测试:

from sklearn.model_selection import StratifiedShuffleSplit    

sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1, 
                             test_size=test_samples, random_state=0)  

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
Run Code Online (Sandbox Code Playgroud)

演示

In [54]: from scipy import stats

In [55]: stats.itemfreq(y_train)
Out[55]: 
array([[   0, 2500],
       [   1, 2500]], dtype=int64)

In [56]: stats.itemfreq(y_test)
Out[56]: 
array([[    0, 25000],
       [    1, 25000]], dtype=int64)
Run Code Online (Sandbox Code Playgroud)

编辑

正如@geompalik 正确指出的那样,如果您的数据集不平衡,StratifiedShuffleSplit则不会产生平衡的分割。在这种情况下,您可能会发现此功能很有用:


def stratified_split(y, train_ratio):
    
    def split_class(y, label, train_ratio):
        indices = np.flatnonzero(y == label)
        n_train = int(indices.size*train_ratio)
        train_index = indices[:n_train]
        test_index = indices[n_train:]
        return (train_index, test_index)
        
    idx = [split_class(y, label, train_ratio) for label in np.unique(y)]
    train_index = np.concatenate([train for train, _ in idx])
    test_index = np.concatenate([test for _, test in idx])
    return train_index, test_index
Run Code Online (Sandbox Code Playgroud)

演示

我已经预先生成了模拟数据,其中包含您指定的每个类的样本数(此处未显示代码)。

In [153]: y
Out[153]: array([1, 0, 1, ..., 0, 0, 1])

In [154]: y.size
Out[154]: 55000

In [155]: train_ratio = float(train_samples)/(train_samples + test_samples)  

In [156]: train_ratio
Out[156]: 0.09090909090909091

In [157]: train_index, test_index = stratified_split(y, train_ratio)

In [158]: y_train = y[train_index]

In [159]: y_test = y[test_index]

In [160]: y_train.size
Out[160]: 5000

In [161]: y_test.size
Out[161]: 50000

In [162]: stats.itemfreq(y_train)
Out[162]: 
array([[   0, 2438],
       [   1, 2562]], dtype=int64)

In [163]: stats.itemfreq(y_test)
Out[163]: 
array([[    0, 24380],
       [    1, 25620]], dtype=int64)
Run Code Online (Sandbox Code Playgroud)