Sid*_*war 5 dataframe python-3.x pandas
我想通过分层采样将 Dataframe 分成 4 部分。确保“B”列中的所有类别都应出现在每个块中。如果任何类别没有足够的记录用于所有块,请将相同的记录复制到剩余的块中。
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three',
'one', 'one', 'two', 'three',
'two', 'two', 'one', 'three', 'four'],
'C' : np.random.randn(17), 'D' : np.random.randn(17)})
print(df)
A B C D
0 foo one 0.960627 0.318723
1 bar one 0.269439 -0.945565
2 foo two 0.210376 0.765680
3 bar three -0.375095 -1.617334
4 foo two -1.910716 -0.532117
5 bar two -0.277426 0.019717
6 foo one -0.260074 1.384464
7 foo three 0.072119 -1.077725
8 foo one 0.093446 -0.683513
9 bar one -0.154885 -1.453996
10 foo two -1.258207 1.406615
11 bar three -0.003332 -0.083092
12 foo two 1.250562 0.519337
13 bar two -0.837681 -1.465363
14 foo one -0.403992 -0.133496
15 foo three -0.757623 -0.459532
16 bar four -2.071840 0.802953
Run Code Online (Sandbox Code Playgroud)
输出应如下所示(“B”列中的所有类别应出现在每个块中。索引并不重要)
A B C D
0 foo one 0.200466 -0.394136
2 foo two 0.086008 -0.528286
3 bar three -1.979613 -1.345405
8 foo one -1.195563 -0.832880
15 foo three -0.737060 -0.437047
16 bar four -2.071840 0.802953
A B C D
1 bar one 1.177119 0.693766
4 foo two 0.452803 -0.595433
7 foo three 1.285687 1.107021
12 foo two 1.746976 1.449390
16 bar four -2.071840 0.802953
A B C D
6 foo one -0.095485 0.129541
5 bar two 0.803417 -0.219461
7 foo three 1.285687 1.107021
13 bar two 1.166246 -1.711505
16 bar four -2.071840 0.802953
A B C D
9 bar one 2.001238 -0.283411
10 foo two 0.865580 0.052533
11 bar three -0.437604 -0.652073
14 foo one -0.655985 -0.942792
16 bar four -2.071840 0.802953
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
547 次 |
| 最近记录: |