Pandas:为groupby标识的每个组分配一个索引

Question

Pandas:为groupby标识的每个组分配一个索引

使用groupby()时,如何使用包含组编号索引的新列创建DataFrame,类似于dplyr::group_indicesR中.例如,如果我有

>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
   a  b
0  1  1
1  1  1
2  1  2
3  2  1
4  2  1
5  2  2

Run Code Online (Sandbox Code Playgroud)

我怎么能得到一个像DataFrame

Run Code Online (Sandbox Code Playgroud)

(idx索引的顺序无关紧要)

Answer 1

fog*_*rit 15

一种简单的方法是连接分组列(以便它们的每个值组合代表一个独特的不同元素),然后将其转换为pandas Categorical并仅保留其标签:

df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df

    a   b   idx
0   1   1   0
1   1   1   0
2   1   2   1
3   2   1   2
4   2   1   2
5   2   2   3

Run Code Online (Sandbox Code Playgroud)

编辑:更改labels属性,codes因为前者似乎已被弃用

Edit2:根据Authman Apatira的建议添加了一个分隔符

你们要小心这样合并列.a = 11,b = 1将产生与a = 1,b = 11相同的组码,而实际上它们是不同的.如果你想这样做,一定要在列之间添加某种分隔符.我希望看到这种方法针对适当的群体进行基准测试,但对于内存和处理器要求都是如此. (2认同)

Answer 2

Joh*_*hnE 15

这是一种使用drop_duplicates和merge获取唯一标识符的简洁方法.

group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )

   a  b  index
0  1  1      0
1  1  1      0
2  1  2      2
3  2  1      3
4  2  1      3
5  2  2      5

Run Code Online (Sandbox Code Playgroud)

在这种情况下,标识符为0,2,3,5(只是原始索引的残差),但这可以很容易地更改为0,1,2,3 reset_index(drop=True).

Answer 3

Cal*_*You 14

下面是使用该解决方案ngroup由一个评论上述由君士坦丁,对于那些仍在寻找这个功能(相当于dplyr::group_indices在R,如果你想与我一样这些关键字,谷歌).根据我自己的时间,这也比maxliving给出的解决方案快约25%.

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df['idx'] = df.groupby(['a', 'b']).ngroup()
>>> df
   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

>>> %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
1.83 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['idx'] = df.groupby(['a', 'b']).ngroup()
1.38 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	9530 次
最近记录：	6 年，5 月前