在pandas中groupby之后选择样本随机组？

Question

在pandas中groupby之后选择样本随机组？

我有一个非常大的 DataFrame，看起来像这个示例 df：

df = 

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
apple   pink     1.99 
apple   pink     1.99 
apple   pink     2.99 
...     ....      ...
pear    green     .99 
pear    green     .99 
pear    green    1.29

Run Code Online (Sandbox Code Playgroud)

我按这样的 2 列分组：

g = df.groupby(['col1', 'col2'])

Run Code Online (Sandbox Code Playgroud)

现在我想选择 3 个随机组。所以我的预期输出是这样的：

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
pear    green     .99 
pear    green     .99 
pear    green    1.29
lemon   yellow    .99 
lemon   yellow    .99 
lemon   yellow   1.99

Run Code Online (Sandbox Code Playgroud)

（假设以上三个组是来自 df 的随机组）。我怎样才能做到这一点？我用过这个。但这对我的情况没有帮助。

Answer 1

WeN*_*Ben 8

你可以用shuffle和ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

Run Code Online (Sandbox Code Playgroud)

通过使用“numpy.random.choice”可以更简洁地完成分组抽样（无需打乱完整列表） - 即。`df[g.ngroup().isin(选择(g.ngroups, 2, Replace=False)]`。 (3认同)

归档时间：	7 年，7 月前
查看次数：	4685 次
最近记录：	4 年，4 月前