Pandas数据帧获得每个组的第一行

Nil*_*age 110 python dataframe pandas

我有一只DataFrame像熊猫一样的熊猫.

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})
Run Code Online (Sandbox Code Playgroud)

我想通过["id","value"]对此进行分组,并得到每个组的第一行.

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth
Run Code Online (Sandbox Code Playgroud)

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth
Run Code Online (Sandbox Code Playgroud)

我试过以下只给出了第一行DataFrame.对此有任何帮助表示赞赏.

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])
Run Code Online (Sandbox Code Playgroud)

Rom*_*kar 195

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth
Run Code Online (Sandbox Code Playgroud)

如果您需要id列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth
Run Code Online (Sandbox Code Playgroud)

要获得n个第一个记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth
Run Code Online (Sandbox Code Playgroud)

  • 如果你想要最后n行,请使用`tail(n)`(默认为n = 5)([ref.](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame .tail.html)).不要与`last()`混淆,我犯了那个错误. (3认同)
  • 非常感谢!效果很好:) 不可能以相同的方式获得第二行,对吗?你也能解释一下吗? (2认同)

小智 46

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 
Run Code Online (Sandbox Code Playgroud)

文档:http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

  • 如果你想要倍数,例如前三个,那么使用像`nth((0,1,2))`或`nth(range(3))`这样的序列. (7认同)

vit*_*dml 24

我建议使用.nth(0)而不是.first()你需要获得第一行.

它们之间的区别在于它们如何处理NaN,因此.nth(0)无论此行中的值是什么,都将返回组的第一行,而.first()最终将返回每列中的第一个 NaN值.

例如,如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first
Run Code Online (Sandbox Code Playgroud)

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first
Run Code Online (Sandbox Code Playgroud)

  • 另一个区别是 nth(0) 将保留原始索引(如果 as_index=False),而 first() 则不会。对我来说,这是一个很大的区别,因为我需要索引本身。 (3认同)

WeN*_*Ben 9

如果您只需要我们可以使用的每个组的第一行drop_duplicates,请注意函数默认方法keep='first'

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth
Run Code Online (Sandbox Code Playgroud)


Sir*_* S. 6

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
Run Code Online (Sandbox Code Playgroud)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
Run Code Online (Sandbox Code Playgroud)
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55
Run Code Online (Sandbox Code Playgroud)