我遇到了以下排序行和列标题的问题.
以下是如何重现这个:
X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor
g = X.groupby([A,B])['x'].mean() #do a two-way bucketing
print g
#this gives the following and so far so good
x y
[-2.315, -0.843] [-2.58, -0.567] -1.041167
(-0.567, 0.0321] -1.722926
(0.0321, 0.724] -1.245856
(0.724, 3.478] -1.240876
(-0.843, -0.228] [-2.58, -0.567] -0.576264
(-0.567, 0.0321] -0.501709
(0.0321, 0.724] -0.522697
(0.724, 3.478] -0.506259
(-0.228, 0.382] [-2.58, -0.567] 0.175768
(-0.567, 0.0321] 0.214353
(0.0321, 0.724] 0.113650
(0.724, 3.478] -0.013758
(0.382, 2.662] [-2.58, -0.567] 0.983807
(-0.567, 0.0321] 1.214640
(0.0321, 0.724] 0.808608
(0.724, 3.478] 1.515334
Name: x, dtype: float64
#Now let's make a two way table and here is the problem:
HTML(g.unstack().to_html())
Run Code Online (Sandbox Code Playgroud)
由此可见:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567]
x
(-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768
(-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264
(0.382, 2.662] 1.214640 0.808608 1.515334 0.983807
[-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167
Run Code Online (Sandbox Code Playgroud)
请注意标题不再排序.我想知道什么是解决这个问题的好方法,以便使交互式工作变得容易.
要进一步跟踪问题所在,请运行以下命令:
g.unstack().columns
Run Code Online (Sandbox Code Playgroud)
它给了我:索引([( - 0.567,0.0321],(0.0321,0.724),(0.724,3.478),[ - 2.58,-0.567]],dtype = object)
现在将其与B.levels进行比较:
B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)
Run Code Online (Sandbox Code Playgroud)
显然,最初在因子中的顺序丢失了.
现在让事情变得更糟,让我们做一个多层次的交叉表:
g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())
Run Code Online (Sandbox Code Playgroud)
它显示如下:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478]
x
(-0.228, 0.382] x 0.214353 0.113650 -0.013758
y -0.293465 0.321836 1.180369
(-0.843, -0.228] x -0.501709 -0.522697 -0.506259
y -0.204811 0.324571 1.167005
(0.382, 2.662] x 1.214640 0.808608 1.515334
y -0.195446 0.161198 1.074532
[-2.315, -0.843] x -1.722926 -1.245856 -1.240876
y -0.392896 0.335471 1.730513
Run Code Online (Sandbox Code Playgroud)
行和列标签都排序不正确.
谢谢.
这似乎有点像黑客,但这里有:
In [11]: g_unstacked = g.unstack()
In [12]: g_unstacked
Out[12]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
x
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
Run Code Online (Sandbox Code Playgroud)
利用unique保留订单*的事实(从g的索引中获取唯一的第一项):
In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
'(0.625, 1.639]'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
如您所见,这些都是正确的顺序.
现在你可以reindex这样:
In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
Run Code Online (Sandbox Code Playgroud)
现在的顺序正确.
更新(我错过了列也没有按顺序排列).
您可以对列使用相同的技巧(您必须链接这些操作):
In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)
Run Code Online (Sandbox Code Playgroud)
*这就是系列独特之处明显更快的原因np.unique.
| 归档时间: |
|
| 查看次数: |
3421 次 |
| 最近记录: |