Zhu*_*arb 4 python group-by pandas
我承认我不是一个Python大师,但我仍然觉得处理Pandas DataFrameGroupBy和SeriesGroupBy对象异常违反直觉.(我有一个R背景.)
我有以下数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'irrelevant1': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant2': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant3': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'amount' : np.random.randn(8)}, columns= ['id','code','colour', 'irrelevant1', 'irrelevant2', 'irrelevant3', 'amount'])
Run Code Online (Sandbox Code Playgroud)
我希望能够按照和id分组.下面的代码进行分组,但保留所有列.codecolour
gb = df.groupby(['code','colour'])
gb.head(5)
id code colour irrelevant1 irrelevant2 irrelevant3 amount
code colour
one black 0 1 one black foo foo foo -0.644170
white 1 2 one white foo foo foo 0.912372
6 7 one white bar bar bar 0.530575
three black 5 6 three black foo foo foo -0.123806
white 3 4 three white bar bar bar -0.387080
two black 4 5 two black bar bar bar -0.578107
white 2 3 two white foo foo foo 0.768637
7 8 two white bar bar bar -0.282577
Run Code Online (Sandbox Code Playgroud)
问题:
1)在gb,我如何只存储id列(甚至没有任何索引)并摆脱其余的?
2)一旦我有了所需的DataFrameGroupBy gb,我如何访问id{code = one和color = white}的情况?我尝试过gb.get_group('one','white'),gb.get_group(['one','white'])但他们不工作.
3)如何访问{color = white},即缺少code索引的条目?
4)最后,手册不是很有帮助,您是否知道有哪些来源可以创建和访问这些分组对象?
对于你的问题,你甚至不需要执行groupby(但你应该在散文文档中阅读更多关于它的内容).
一个更好的解决方案是MultiIndex:
In [36]: df = df.set_index(['code', 'colour']).sort_index()
In [37]: df
Out[37]:
id irrelevant1 irrelevant2 irrelevant3 amount
code colour
one black 1 foo foo foo 0.103045
white 2 foo foo foo 0.751824
white 7 bar bar bar -1.275114
three black 6 foo foo foo 0.311305
white 4 bar bar bar -0.416722
two black 5 bar bar bar 1.534859
white 3 foo foo foo -1.068399
white 8 bar bar bar -0.243893
[8 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)
这照顾1.
2:使用熟悉的切片语法:
In [38]: df.loc['one', 'white']
Out[38]:
id irrelevant1 irrelevant2 irrelevant3 amount
code colour
one white 2 foo foo foo 0.751824
white 7 bar bar bar -1.275114
[2 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)
3:这是一个横截面,用途.xs:
In [39]: df.xs('white', level='colour')
Out[39]:
id irrelevant1 irrelevant2 irrelevant3 amount
code
one 2 foo foo foo 0.751824
one 7 bar bar bar -1.275114
three 4 bar bar bar -0.416722
two 3 foo foo foo -1.068399
two 8 bar bar bar -0.243893
[5 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)
4:各地都有例子.这里检查大熊猫/ GROUPBY标签,这对文档的部分被上工作,现在,上面链接的散文文档.