我是sklearn的一个相对较新的用户,并且在sklearn.model_selection的train_test_split中遇到了一些意想不到的行为.我有一个熊猫数据框,我想分成一个训练和测试集.我想将数据分层至少2个,但理想情况下我的数据框中有4列.
当我尝试这样做时,sklearn没有警告,但后来我发现我的最终数据集中有重复的行.我创建了一个示例测试来显示此行为:
from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
Run Code Online (Sandbox Code Playgroud)
如果我按任一列分层,它似乎按预期工作:
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000
Run Code Online (Sandbox Code Playgroud)
但是当我尝试按两列分层时,我得到重复的值:
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 640000
Run Code Online (Sandbox Code Playgroud)