ako*_*ako 3 python group-by weighted-average pandas
我有一个数据集,每个观察都有权重,我想准备加权摘要,groupby
但是生锈了如何最好地做到这一点.我认为这意味着自定义聚合功能.我的问题是如何正确处理不是逐项数据,而是分组数据.也许这意味着最好是分步而不是一步到位.
在伪代码中,我正在寻找
#first, calculate weighted value
for each row:
weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
sum(weighted jobs)/sum(weight)
Run Code Online (Sandbox Code Playgroud)
我不知道如何处理"为每个城市" - 分成自定义聚合函数并访问组级摘要.
模拟数据:
import pandas as pd
import numpy as np
np.random.seed(43)
## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
Run Code Online (Sandbox Code Playgroud)
只需将两列相乘:
In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']
Run Code Online (Sandbox Code Playgroud)
现在你可以将城市分组(并获得总和):
In [12]: df_city_sums = df_city.groupby('city').sum()
In [13]: df_city_sums
Out[13]:
jobs weight weighted_jobs
city
oakland 362 690 7958
san mateo 367 1017 9026
sf 253 638 6209
[3 rows x 3 columns]
Run Code Online (Sandbox Code Playgroud)
现在你可以将两个总和除以得到所需的结果:
In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]:
city
oakland 21.983425
san mateo 24.594005
sf 24.541502
dtype: float64
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3051 次 |
最近记录: |