使用pandas/dataframe计算加权平均值

mik*_*010 30 python numpy pandas

我有下表.我想根据下面的公式计算按每个日期分组的加权平均值.我可以使用一些标准的传统代码来做到这一点,但假设这些数据是在pandas数据帧中,有没有更简单的方法来实现这一点,而不是通过迭代?

Date        ID      wt      value   w_avg
01/01/2012  100     0.50    60      0.791666667
01/01/2012  101     0.75    80
01/01/2012  102     1.00    100
01/02/2012  201     0.50    100     0.722222222
01/02/2012  202     1.00    80
Run Code Online (Sandbox Code Playgroud)

01/01/2012 w_avg = 0.5*(60/sum(60,80,100))+ .75*(80/sum(60,80,100))+ 1.0*(100/sum(60,80,100))

01/02/2012 w_avg = 0.5*(100/sum(100,80))+ 1.0*(80/sum(100,80))

And*_*den 23

我想我会和两个小组一起做这件事.

首先计算"加权平均值":

In [11]: g = df.groupby('Date')

In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0    0.125000
1    0.250000
2    0.416667
3    0.277778
4    0.444444
dtype: float64
Run Code Online (Sandbox Code Playgroud)

如果将其设置为列,则可以将其分组:

In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt
Run Code Online (Sandbox Code Playgroud)

现在,此列的总和是所需的:

In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012    0.791667
01/02/2012    0.722222
Name: wa, dtype: float64
Run Code Online (Sandbox Code Playgroud)

或潜在的:

In [15]: g.wa.transform("sum")
Out[15]:
0    0.791667
1    0.791667
2    0.791667
3    0.722222
4    0.722222
Name: wa, dtype: float64
Run Code Online (Sandbox Code Playgroud)


kad*_*dee 19

让我们首先创建示例pandas dataframe:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')

In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
Run Code Online (Sandbox Code Playgroud)

然后,按"值"加权并按索引分组的'wt'的平均值获得如下:

In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]: 
Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64
Run Code Online (Sandbox Code Playgroud)

或者,也可以定义一个函数:

In [5]: def grouped_weighted_avg(values, weights, by):
   ...:     return (values * weights).groupby(by).sum() / weights.groupby(by).sum()

In [6]: grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index)
Out[6]: 
Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64
Run Code Online (Sandbox Code Playgroud)

  • 是否有可能在这一行中:在[5]中:df.groupby(df.index).apply(lambda x:np.average(x.wt,weights = x.value))x.wt和x.value应该被切换? (2认同)

Ban*_*ana 11

如果速度对您来说很重要,那么矢量化就至关重要。因此,根据Andy Hayden 的回答,这是一个仅使用 Pandas 本机函数的解决方案:

def weighted_mean(df, values, weights, groupby):
    df = df.copy()
    grouped = df.groupby(groupby)
    df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
    return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects
Run Code Online (Sandbox Code Playgroud)

相比之下,使用自定义lambda函数代码更少,但速度更慢:

import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
    return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))
Run Code Online (Sandbox Code Playgroud)

速度测试:

import time
import numpy as np
import pandas as pd

n = 100000000

df = pd.DataFrame({
    'values': np.random.uniform(0, 1, size=n), 
    'weights': np.random.randint(0, 5, size=n),
    'groupby': np.random.randint(0, 10000, size=n), 
})

time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)

time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)
Run Code Online (Sandbox Code Playgroud)

速度测试输出:

Time for `weighted_mean`: 3.4519572257995605
Time for `weighted_mean_by_lambda`: 11.41335940361023
Run Code Online (Sandbox Code Playgroud)


小智 6

我觉得以下是从以下方面解决此问题的一种优雅解决方案:(Pandas DataFrame使用多个列的聚合函数

grouped = df.groupby('Date')

def wavg(group):
    d = group['value']
    w = group['wt']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)
Run Code Online (Sandbox Code Playgroud)


use*_*990 6

我将表保存在.csv文件中

df=pd.read_csv('book1.csv')

grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)
Run Code Online (Sandbox Code Playgroud)