将百分位数传递给pandas agg功能

Question

将百分位数传递给pandas agg功能

我想通过pandas的agg()函数传递numpy percentile()函数,就像我在下面用各种其他numpy统计函数一样.

现在我有一个如下所示的数据框:

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

Run Code Online (Sandbox Code Playgroud)

我的代码看起来像这样:

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

Run Code Online (Sandbox Code Playgroud)

上面的代码有效,但我想做类似的事情

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

Run Code Online (Sandbox Code Playgroud)

即指定从agg()返回的各种百分位数

该怎么做？

Answer 1

And*_*den 75

也许不是超级高效,但一种方法是自己创建一个函数:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

Run Code Online (Sandbox Code Playgroud)

然后将此包含在您的agg:

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Run Code Online (Sandbox Code Playgroud)

请注意确保这是应该如何做到的......

Answer 2

prl*_*900 13

更具体地说,如果你只想使用百分位函数聚合你的pandas groupby结果,python lambda函数提供了一个非常简洁的解决方案.使用问题的符号,按百分位数95汇总,应该是:

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

Run Code Online (Sandbox Code Playgroud)

您还可以将此函数分配给变量,并将其与其他聚合函数结合使用.

Answer 3

Tho*_*mas 13

我真的很喜欢Andy Hayden 给出的解决方案，但是，这对我来说有很多问题：

如果数据框有多个列，它会在列上聚合而不是在行上聚合？
对我来说，行名称是percentile_0.5（点而不是下划线）。不确定是什么原因造成的，可能是我使用的是 Python 3。
还需要导入 numpy 而不是留在熊猫中（我知道，numpy 是在熊猫中隐式导入的...）

这是修复这些问题的更新版本：

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
    return percentile_

Run Code Online (Sandbox Code Playgroud)

我认为格式“{:02.0f}”会更好地避免单位数字百分比值的空格。 (4认同)
你打算在你的版本中`return x.quantile(n)`吗？ (2认同)

Answer 4

小智 12

df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])

Run Code Online (Sandbox Code Playgroud)

默认情况下，describe函数为我们提供mean, count, std, min, max，并且使用百分位数数组，您可以选择所需的百分位数。

Answer 5

Ant*_*iez 12

更有效的解决pandas.Series.quantile方法：

df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))

Run Code Online (Sandbox Code Playgroud)

有几个百分位值

percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)

Run Code Online (Sandbox Code Playgroud)

Answer 6

sco*_*tle 11

试试50%和95%的百分位:

column.describe( percentiles = [ 0.5, 0.95 ] )

Run Code Online (Sandbox Code Playgroud)

Answer 7

jva*_*ans 10

我相信在熊猫中这样做的惯用方法是：

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

Run Code Online (Sandbox Code Playgroud)

Answer 8

Mak*_*sim 7

For situations where all you need is a subset of the describe (typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

columns.agg('describe')[['25%', '50%', '75%', 'count']]

Run Code Online (Sandbox Code Playgroud)

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

Answer 9

mag*_*raf 6

只是为了将一个更通用的解决方案投入环中。假设您有一个只有一列要分组的 DF：

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

Run Code Online (Sandbox Code Playgroud)

人们可以使用一系列匿名（lambda）函数来聚合和计算基本上任何描述性指标，例如：

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

Run Code Online (Sandbox Code Playgroud)

但是，如果要聚合多个列，则必须调用非匿名函数或显式调用这些列：

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})

Run Code Online (Sandbox Code Playgroud)

Answer 10

小智 6

您也许也可以使用 lambda 来实现相同的目的。像下面的代码：

        agg(
            lambda x: [
                np.min(a=x), 
                np.percentile(q=25,a=x), 
                np.median(a=x), 
                np.percentile(q=75,a=x), 
                np.max(a=x)
    ]
)

Run Code Online (Sandbox Code Playgroud)

Answer 11

Aru*_*pet 5

您可以让agg（）使用自定义函数在指定的列上执行：

# 50th Percentile
def q50(x):
            return x.quantile(0.5)

# 90th Percentile
def q90(x):
            return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，6 月前
查看次数：	28580 次
最近记录：	6 年，3 月前