xia*_*hao 1 python dataframe pandas
我有一个日期框“df”,用于存储用户的订单:
user_id order_date
0 a 2018-01-17
1 a 2018-04-29
2 a 2018-05-19
3 a 2018-05-21
4 a 2018-06-15
5 b 2018-09-18
6 b 2019-01-30
7 b 2019-02-01
8 b 2019-07-03
9 c 2019-07-31
10 c 2019-12-10
11 c 2019-12-12
12 c 2019-12-24
Run Code Online (Sandbox Code Playgroud)
“order_date”已订购。我想知道不同订单的不同用户的日期差异。我需要使用“groupby”来分隔用户,然后计算 datediff。结果应该是:
user_id datediff
0 a NA
1 a 102
2 a 20
3 a 2
4 a 25
5 b NA
6 b 134
7 b 2
8 b 152
9 c NA
10 c 132
11 c 2
12 c 12
Run Code Online (Sandbox Code Playgroud)
我知道如何通过使用朴素循环来实现这一点。如何以更好的方式获得它,例如移位或滚动?顺便说一句,您不必获得完全相同的结果。“NA”可以是“NAT”。“102”可能是“102 天”。
此外,如何获取不同用户的 datediff 方式?结果应该是:
user_id mean_datediff
0 a 37.25
1 b 68.00
2 c 48.67
Run Code Online (Sandbox Code Playgroud)
对于用户a来说,平均值是(102+20+2+25)/4=37.25,而不是149/5。
最后一步是将“mean_datediff”添加到原始 df 中。预期输出是:
user_id order_date mean_datediff
0 a 2018-01-17 37.25
1 a 2018-04-29 37.25
2 a 2018-05-19 37.25
3 a 2018-05-21 37.25
4 a 2018-06-15 37.25
5 b 2018-09-18 68.00
6 b 2019-01-30 68.00
7 b 2019-02-01 68.00
8 b 2019-07-03 68.00
9 c 2019-07-31 48.67
10 c 2019-12-10 48.67
11 c 2019-12-12 48.67
12 c 2019-12-24 48.67
Run Code Online (Sandbox Code Playgroud)
用于DataFrameGroupBy.diff差异,Series.dt.days将时间增量转换为天数:
df['order_date'] = pd.to_datetime(df['order_date'])
df['datediff'] = df.groupby(['user_id'])['order_date'].diff().dt.days
print (df)
user_id order_date datediff
0 a 2018-01-17 NaN
1 a 2018-04-29 102.0
2 a 2018-05-19 20.0
3 a 2018-05-21 2.0
4 a 2018-06-15 25.0
5 b 2018-09-18 NaN
6 b 2019-01-30 134.0
7 b 2019-02-01 2.0
8 b 2019-07-03 152.0
9 c 2019-07-31 NaN
10 c 2019-12-10 132.0
11 c 2019-12-12 2.0
12 c 2019-12-24 12.0
Run Code Online (Sandbox Code Playgroud)
如果需要的话,可以用整数相加Series.astype,Int64工作pandas 0.24+:
df['order_date'] = pd.to_datetime(df['order_date'])
df['datediff'] = df.groupby(['user_id'])['order_date'].diff().dt.days.astype('Int64')
print (df)
user_id order_date datediff
0 a 2018-01-17 NaN
1 a 2018-04-29 102
2 a 2018-05-19 20
3 a 2018-05-21 2
4 a 2018-06-15 25
5 b 2018-09-18 NaN
6 b 2019-01-30 134
7 b 2019-02-01 2
8 b 2019-07-03 152
9 c 2019-07-31 NaN
10 c 2019-12-10 132
11 c 2019-12-12 2
12 c 2019-12-24 12
Run Code Online (Sandbox Code Playgroud)
编辑:
mean对于由s填充的新列,请GroupBy.transform使用 lambda 函数:
df['mean_datediff'] = (df.groupby(['user_id'])['order_date']
.transform(lambda x: x.diff().dt.days.mean()))
print (df)
user_id order_date mean_datediff
0 a 2018-01-17 37.250000
1 a 2018-04-29 37.250000
2 a 2018-05-19 37.250000
3 a 2018-05-21 37.250000
4 a 2018-06-15 37.250000
5 b 2018-09-18 96.000000
6 b 2019-01-30 96.000000
7 b 2019-02-01 96.000000
8 b 2019-07-03 96.000000
9 c 2019-07-31 48.666667
10 c 2019-12-10 48.666667
11 c 2019-12-12 48.666667
12 c 2019-12-24 48.666667
Run Code Online (Sandbox Code Playgroud)