Kee*_*eth 13 python sorting multi-index pandas
我有一个通过groupby操作创建的多索引DataFrame.我正在尝试使用索引的几个级别进行复合排序,但我似乎找不到能够满足我需要的排序函数.
初始数据集看起来像这样(各种产品的每日销售额):
Date Manufacturer Product Name Product Launch Date Sales
0 2013-01-01 Apple iPod 2001-10-23 12
1 2013-01-01 Apple iPad 2010-04-03 13
2 2013-01-01 Samsung Galaxy 2009-04-27 14
3 2013-01-01 Samsung Galaxy Tab 2010-09-02 15
4 2013-01-02 Apple iPod 2001-10-23 22
5 2013-01-02 Apple iPad 2010-04-03 17
6 2013-01-02 Samsung Galaxy 2009-04-27 10
7 2013-01-02 Samsung Galaxy Tab 2010-09-02 7
Run Code Online (Sandbox Code Playgroud)
我使用groupby来获取日期范围的总和:
> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
Sales
Manufacturer Product Name Product Launch Date
Apple iPad 2010-04-03 30
iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
Run Code Online (Sandbox Code Playgroud)
到现在为止还挺好!
现在我要做的最后一件事就是按发布日期对每个制造商的产品进行排序,但是将它们按层次结构分组在制造商下 - 这就是我要做的所有事情:
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
iPad 2010-04-03 30
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
Run Code Online (Sandbox Code Playgroud)
当我尝试sortlevel()时,我失去了以前的公司层次结构:
> grouped.sortlevel('Product Launch Date')
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Apple iPad 2010-04-03 30
Samsung Galaxy Tab 2010-09-02 22
Run Code Online (Sandbox Code Playgroud)
sort()和sort_index()失败:
grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'
grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'
Run Code Online (Sandbox Code Playgroud)
看起来像一个简单的操作,但我无法弄明白.
我并不喜欢使用MultiIndex,但是因为那是groupby()返回的,所以我一直在使用它.
BTW生成初始DataFrame的代码是:
data = {
'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'],
'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])
Run Code Online (Sandbox Code Playgroud)
黑客将改变级别的顺序:
In [11]: g
Out[11]:
Sales
Manufacturer Product Name Product Launch Date
Apple iPad 2010-04-03 30
iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
In [12]: g.index = g.index.swaplevel(1, 2)
Run Code Online (Sandbox Code Playgroud)
Sortlevel,(正如您所发现的)按顺序对MultiIndex级别进行排序:
In [13]: g = g.sortlevel()
Run Code Online (Sandbox Code Playgroud)
并换回:
In [14]: g.index = g.index.swaplevel(1, 2)
In [15]: g
Out[15]:
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
iPad 2010-04-03 30
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
Run Code Online (Sandbox Code Playgroud)
我认为sortlevel不应该按顺序排序剩余的标签,因此会创建一个github问题.:)虽然值得一提的是关于"排序需求"的文档.
注意:您可以swaplevel通过重新排序初始groupby的顺序来避免第一个:
g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()
Run Code Online (Sandbox Code Playgroud)
这一个班轮对我有用:
In [1]: grouped.sortlevel(["Manufacturer","Product Launch Date"], sort_remaining=False)
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
iPad 2010-04-03 30
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
Run Code Online (Sandbox Code Playgroud)
请注意,这也有效:
groups.sortlevel([0,2], sort_remaining=False)
Run Code Online (Sandbox Code Playgroud)
当您两年前最初发布时,这不会起作用,因为默认情况下 sortlevel 对所有索引进行排序,这会破坏您的公司层次结构。sort_remaining禁用该行为是去年添加的。这是供参考的提交链接:https : //github.com/pydata/pandas/commit/3ad64b11e8e4bef47e3767f1d31cc26e39593277
要按“索引列”(也称为级别)对 MultiIndex 进行排序,您需要使用该.sort_index()方法并设置其level参数。如果要按多个级别排序,则需要将参数设置为按顺序排列的级别名称列表。
这应该为您提供所需的 DataFrame:
df.groupby(['Manufacturer',
'Product Name',
'Launch Date']
).sum().sort_index(level=['Manufacturer','Launch Date'])
Run Code Online (Sandbox Code Playgroud)