aho*_*osh 12 python time-series aggregation pandas
我有一个这样的Pandas数据框:
test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
'2016-04-02','2016-04-03','2016-04-04',
'2016-04-05','2016-04-06','2016-04-06'],
'User' : ['Mike','John','Mike','John','Mike','Mike',
'Mike','Mike','John'],
'Value' : [1,2,1,3,4.5,1,2,3,6]
})
Run Code Online (Sandbox Code Playgroud)
如下所示,数据集不一定每天都有观察结果:
Date User Value
0 2016-04-01 Mike 1.0
1 2016-04-01 John 2.0
2 2016-04-02 Mike 1.0
3 2016-04-02 John 3.0
4 2016-04-03 Mike 4.5
5 2016-04-04 Mike 1.0
6 2016-04-05 Mike 2.0
7 2016-04-06 Mike 3.0
8 2016-04-06 John 6.0
Run Code Online (Sandbox Code Playgroud)
我想添加一个新列,显示过去n天内每个用户的平均值(在这种情况下n = 2),如果至少有一天可用,否则它会nan有价值.例如,2016-04-06约翰得到一个nan因为他没有2016-04-05和的数据2016-04-04.所以结果将是这样的:
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
Run Code Online (Sandbox Code Playgroud)
在阅读论坛中的几篇帖子之后,我似乎应该结合group_by并定制rolling_mean,但我无法弄清楚如何去做.
jez*_*ael 11
我觉得你可以先用转换栏Date to_datetime,然后找到丢失Days的groupby与resample和最后apply rolling
test['Date'] = pd.to_datetime(test['Date'])
df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
print df
User Value
User Date
John 2016-04-01 John 2.0
2016-04-02 John 3.0
2016-04-03 NaN NaN
2016-04-04 NaN NaN
2016-04-05 NaN NaN
2016-04-06 John 6.0
Mike 2016-04-01 Mike 1.0
2016-04-02 Mike 1.0
2016-04-03 Mike 4.5
2016-04-04 Mike 1.0
2016-04-05 Mike 2.0
df1 = df.groupby(level=0)['Value']
.apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
.reset_index(name='Value_Average_Past_2_days')
Run Code Online (Sandbox Code Playgroud)
print df1
User Date Value_Average_Past_2_days
0 John 2016-04-01 NaN
1 John 2016-04-02 2.00
2 John 2016-04-03 2.50
3 John 2016-04-04 3.00
4 John 2016-04-05 NaN
5 John 2016-04-06 NaN
6 Mike 2016-04-01 NaN
7 Mike 2016-04-02 1.00
8 Mike 2016-04-03 1.00
9 Mike 2016-04-04 2.75
10 Mike 2016-04-05 2.75
11 Mike 2016-04-06 1.50
print pd.merge(test, df1, on=['Date', 'User'], how='left')
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6375 次 |
| 最近记录: |