use*_*396 6 python group-by apply dataframe pandas
我很难在数据帧中或者在groupby中解决回顾或翻转问题.
以下是我拥有的数据帧的简单示例:
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
…
20141231 apple 3
20141231 grape 2
Run Code Online (Sandbox Code Playgroud)
我需要计算每天前3天每种水果'量'的平均值,并创建以下数据框:
fruit average_in_last 3 days
20140104 apple 4
20140104 orange 10
...
Run Code Online (Sandbox Code Playgroud)
例如在20140104,前3天是20140101,20140102和20140103(注意数据框中的日期不连续且20140103不存在),苹果的平均数量是(3 + 5)/ 2 = 4和橙色是10/1 = 10,其余为0.
样本数据框非常简单,但实际数据框更复杂,更大.希望有人能对此有所了解,谢谢你提前!
假设我们在开始时有一个这样的数据框,
>>> df
fruit amount
2017-06-01 apple 1
2017-06-03 apple 16
2017-06-04 apple 12
2017-06-05 apple 8
2017-06-06 apple 14
2017-06-08 apple 1
2017-06-09 apple 4
2017-06-02 orange 13
2017-06-03 orange 9
2017-06-04 orange 9
2017-06-05 orange 2
2017-06-06 orange 11
2017-06-07 orange 6
2017-06-08 orange 3
2017-06-09 orange 3
2017-06-10 orange 13
2017-06-02 grape 14
2017-06-03 grape 16
2017-06-07 grape 4
2017-06-09 grape 15
2017-06-10 grape 5
>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]
>>> temp = (df.groupby('fruit')['amount']
.apply(lambda x: x.reindex(dates) # fill in the missing dates for each group)
.fillna(0) # fill each missing group with 0
.rolling(3)
.sum()) # do a rolling sum
.reset_index()
.rename(columns={'amount': 'sum_of_3_days',
'level_1': 'date'})) # rename date index to date col
>>> temp.head()
fruit date amount
0 apple 2017-06-01 NaN
1 apple 2017-06-02 NaN
2 apple 2017-06-03 17.0
3 apple 2017-06-04 28.0
4 apple 2017-06-05 36.0
# converts the date index into date column
>>> df = df.reset_index().rename(columns={'index': 'date'})
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
date fruit amount sum_of_3_days
0 2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
6 2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06 orange 11 22.0
12 2017-06-07 orange 6 19.0
13 2017-06-08 orange 3 20.0
14 2017-06-09 orange 3 12.0
15 2017-06-10 orange 13 19.0
16 2017-06-02 grape 14 NaN
17 2017-06-03 grape 16 30.0
18 2017-06-07 grape 4 4.0
19 2017-06-09 grape 15 19.0
20 2017-06-10 grape 5 20.0
Run Code Online (Sandbox Code Playgroud)
小智 5
我也想对groupby使用滚动,这就是为什么我登陆此页面,但是我认为我有一个比以前的建议更好的解决方法。
您可以执行以下操作:
pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack()
Run Code Online (Sandbox Code Playgroud)
这.stack()不是必需的,但会将您的数据透视表转换回常规df
你可以这样做:
>>> df
>>>
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res
fruit 0
20140101 apple 3.000000
20140102 apple 4.000000
20140102 orange 10.000000
20140104 banana 2.000000
20140104 apple 6.000000
20140104 orange 7.000000
20140105 orange 6.666667
20140105 grape 1.000000
Run Code Online (Sandbox Code Playgroud)
更新
好吧,正如 @cphlewis 在评论中提到的,我的代码不会给出你想要的结果。我检查了不同的方法,到目前为止我发现的方法是这样的(但不确定性能):
>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()
>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df
index fruit amount res
0 2014-01-01 apple 3 NaN
1 2014-01-02 apple 5 3
2 2014-01-02 orange 10 NaN
3 2014-01-04 banana 2 NaN
4 2014-01-04 apple 10 4
5 2014-01-04 orange 4 10
6 2014-01-05 orange 6 7
7 2014-01-05 grape 1 NaN
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7810 次 |
| 最近记录: |