如何在sklearn中获得一个非混乱的train_test_split

max*_*moo 13 python scikit-learn

如果我想要随机训练/测试分裂,我使用sklearn辅助函数:

In [1]: from sklearn.model_selection import train_test_split
   ...: train_test_split([1,2,3,4,5,6])
   ...:
Out[1]: [[1, 6, 4, 2], [5, 3]]
Run Code Online (Sandbox Code Playgroud)

什么是最简洁的方式来获得非混乱的火车/测试分裂,即

[[1,2,3,4], [5,6]]
Run Code Online (Sandbox Code Playgroud)

编辑目前我正在使用

train, test = data[:int(len(data) * 0.75)], data[int(len(data) * 0.75):] 
Run Code Online (Sandbox Code Playgroud)

但希望有更好的东西.我在sklearn上打开了一个问题 https://github.com/scikit-learn/scikit-learn/issues/8844

编辑2:我的PR已经被合并,在scikit学习版本0.19,你可以传递参数shuffle=False,以train_test_split获得非改组的分裂.

小智 9

您需要做的就是将 shuffle 参数设置为 False 并将分层参数设置为 None:

    In [49]: train_test_split([1,2,3,4,5,6],shuffle = False, stratify = None)
    Out[49]: [[1, 2, 3, 4], [5, 6]]
Run Code Online (Sandbox Code Playgroud)


Ana*_*ake 8

除了易于复制的粘贴功能之外,我对Psidom的答案并没有太多补充:

def non_shuffling_train_test_split(X, y, test_size=0.2):
    i = int((1 - test_size) * X.shape[0]) + 1
    X_train, X_test = np.split(X, [i])
    y_train, y_test = np.split(y, [i])
    return X_train, X_test, y_train, y_test
Run Code Online (Sandbox Code Playgroud)

更新:在某些时候,此功能内置,所以现在你可以这样做:

from sklearn.model_selection import train_test_split
train_test_split(X, y, test_size=0.2, shuffle=False)
Run Code Online (Sandbox Code Playgroud)


Psi*_*dom 5

使用numpy.split

import numpy as np
data = np.array([1,2,3,4,5,6])

np.split(data, [4])           # modify the index here to specify where to split the array
# [array([1, 2, 3, 4]), array([5, 6])]
Run Code Online (Sandbox Code Playgroud)

如果您想按百分比拆分,您可以根据数据的形状计算拆分索引:

data = np.array([1,2,3,4,5,6])
p = 0.6

idx = int(p * data.shape[0]) + 1      # since the percentage may end up to be a fractional 
                                      # number, modify this as you need, usually shouldn't
                                      # affect much if data is large
np.split(data, [idx])
# [array([1, 2, 3, 4]), array([5, 6])]
Run Code Online (Sandbox Code Playgroud)