mik*_*010 30 python numpy pandas
我有下表.我想根据下面的公式计算按每个日期分组的加权平均值.我可以使用一些标准的传统代码来做到这一点,但假设这些数据是在pandas数据帧中,有没有更简单的方法来实现这一点,而不是通过迭代?
Date ID wt value w_avg
01/01/2012 100 0.50 60 0.791666667
01/01/2012 101 0.75 80
01/01/2012 102 1.00 100
01/02/2012 201 0.50 100 0.722222222
01/02/2012 202 1.00 80
Run Code Online (Sandbox Code Playgroud)
01/01/2012 w_avg = 0.5*(60/sum(60,80,100))+ .75*(80/sum(60,80,100))+ 1.0*(100/sum(60,80,100))
01/02/2012 w_avg = 0.5*(100/sum(100,80))+ 1.0*(80/sum(100,80))
And*_*den 23
我想我会和两个小组一起做这件事.
首先计算"加权平均值":
In [11]: g = df.groupby('Date')
In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0 0.125000
1 0.250000
2 0.416667
3 0.277778
4 0.444444
dtype: float64
Run Code Online (Sandbox Code Playgroud)
如果将其设置为列,则可以将其分组:
In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt
Run Code Online (Sandbox Code Playgroud)
现在,此列的总和是所需的:
In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
Name: wa, dtype: float64
Run Code Online (Sandbox Code Playgroud)
或潜在的:
In [15]: g.wa.transform("sum")
Out[15]:
0 0.791667
1 0.791667
2 0.791667
3 0.722222
4 0.722222
Name: wa, dtype: float64
Run Code Online (Sandbox Code Playgroud)
kad*_*dee 19
让我们首先创建示例pandas dataframe:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')
In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
Run Code Online (Sandbox Code Playgroud)
然后,按"值"加权并按索引分组的'wt'的平均值获得如下:
In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
Run Code Online (Sandbox Code Playgroud)
或者,也可以定义一个函数:
In [5]: def grouped_weighted_avg(values, weights, by):
...: return (values * weights).groupby(by).sum() / weights.groupby(by).sum()
In [6]: grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index)
Out[6]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
Run Code Online (Sandbox Code Playgroud)
Ban*_*ana 11
如果速度对您来说很重要,那么矢量化就至关重要。因此,根据Andy Hayden 的回答,这是一个仅使用 Pandas 本机函数的解决方案:
def weighted_mean(df, values, weights, groupby):
df = df.copy()
grouped = df.groupby(groupby)
df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects
Run Code Online (Sandbox Code Playgroud)
相比之下,使用自定义lambda函数代码更少,但速度更慢:
import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))
Run Code Online (Sandbox Code Playgroud)
速度测试:
import time
import numpy as np
import pandas as pd
n = 100000000
df = pd.DataFrame({
'values': np.random.uniform(0, 1, size=n),
'weights': np.random.randint(0, 5, size=n),
'groupby': np.random.randint(0, 10000, size=n),
})
time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)
time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)
Run Code Online (Sandbox Code Playgroud)
速度测试输出:
Time for `weighted_mean`: 3.4519572257995605
Time for `weighted_mean_by_lambda`: 11.41335940361023
Run Code Online (Sandbox Code Playgroud)
小智 6
我觉得以下是从以下方面解决此问题的一种优雅解决方案:(Pandas DataFrame使用多个列的聚合函数)
grouped = df.groupby('Date')
def wavg(group):
d = group['value']
w = group['wt']
return (d * w).sum() / w.sum()
grouped.apply(wavg)
Run Code Online (Sandbox Code Playgroud)
我将表保存在.csv文件中
df=pd.read_csv('book1.csv')
grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
46879 次 |
| 最近记录: |