通过分层采样将 pandas Dataframe 分成 4 部分

Sid*_*war 5 dataframe python-3.x pandas

我想通过分层采样将 Dataframe 分成 4 部分。确保“B”列中的所有类别都应出现在每个块中。如果任何类别没有足够的记录用于所有块,请将相同的记录复制到剩余的块中。

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo',
                         'foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo', 'bar'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three',
                             'one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three', 'four'],
                       'C' : np.random.randn(17), 'D' : np.random.randn(17)})

print(df)

      A      B         C         D
0   foo    one  0.960627  0.318723
1   bar    one  0.269439 -0.945565
2   foo    two  0.210376  0.765680
3   bar  three -0.375095 -1.617334
4   foo    two -1.910716 -0.532117
5   bar    two -0.277426  0.019717
6   foo    one -0.260074  1.384464
7   foo  three  0.072119 -1.077725
8   foo    one  0.093446 -0.683513
9   bar    one -0.154885 -1.453996
10  foo    two -1.258207  1.406615
11  bar  three -0.003332 -0.083092
12  foo    two  1.250562  0.519337
13  bar    two -0.837681 -1.465363
14  foo    one -0.403992 -0.133496
15  foo  three -0.757623 -0.459532
16  bar   four -2.071840  0.802953
Run Code Online (Sandbox Code Playgroud)

输出应如下所示(“B”列中的所有类别应出现在每个块中。索引并不重要)

     A      B         C         D
0   foo    one  0.200466 -0.394136
2   foo    two  0.086008 -0.528286
3   bar  three -1.979613 -1.345405
8   foo    one -1.195563 -0.832880
15  foo  three -0.737060 -0.437047
16  bar   four -2.071840  0.802953

     A      B         C         D
1   bar    one  1.177119  0.693766
4   foo    two  0.452803 -0.595433
7   foo  three  1.285687  1.107021
12  foo    two  1.746976  1.449390
16  bar   four -2.071840  0.802953

     A      B         C         D
6   foo    one -0.095485  0.129541
5   bar    two  0.803417 -0.219461
7   foo  three  1.285687  1.107021
13  bar    two  1.166246 -1.711505
16  bar   four -2.071840  0.802953

     A      B         C         D
9   bar    one  2.001238 -0.283411
10  foo    two  0.865580  0.052533
11  bar  three -0.437604 -0.652073
14  foo    one -0.655985 -0.942792
16  bar   four -2.071840  0.802953
Run Code Online (Sandbox Code Playgroud)

max*_*mus 1

这可能会有所帮助: df1, df2, df3, df4 = np.array_split(x_train, 4) 来自:将大数据帧分割成较小的相等数据帧