Groupby给出所选DataFrame列的值的百分位数

pms*_*pms 6 python group-by pandas

想象一下,我有一个DataFrame只包含实际值的列.

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685 
Run Code Online (Sandbox Code Playgroud)

我想按所选列(例如,col1)的四分位数(或我指定的任何其他百分位数)对其进行分组,以对这些组执行某些操作.理想情况下,我想做的事情如下:

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?
Run Code Online (Sandbox Code Playgroud)

输出应该给出对应于四分位数的四个组的每个列的平均值col1.这可能与groupby命令有关吗?实现它的最简单方法是什么?

CT *_*Zhu 10

我现在没有电脑来测试它,但我认为你可以通过以下方式进行测试:df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean().将在150分钟后更新.

一些解释:

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
Run Code Online (Sandbox Code Playgroud)

  • 这适合我!切割的精彩使用!谢谢冠军 (2认同)