Nab*_*zir 5 python timestamp dataframe pandas
我想删除重复项并保留最后一个时间戳.想要删除的重复项是customer_id和.这var_name是我的数据
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 3 apple 2018-03-23 08:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
2 5 orange 2018-03-23 08:00:00.000
Run Code Online (Sandbox Code Playgroud)
结果将是
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
Run Code Online (Sandbox Code Playgroud)
我认为需要sort_values有drop_duplicates:
df = df.sort_values('timestamp').drop_duplicates(['customer_id','var_name'], keep='last')
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00.000
3 1 1 orange 2018-03-22 08:00:00.000
2 2 4 apple 2018-03-24 08:00:00.000
4 2 3 orange 2018-03-24 08:00:00.000
Run Code Online (Sandbox Code Playgroud)
如果不需要排序 - 订单很重要:
df = df.loc[df.groupby(['customer_id','var_name'], sort=False)['timestamp'].idxmax()]
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00
2 2 4 apple 2018-03-24 08:00:00
3 1 1 orange 2018-03-22 08:00:00
4 2 3 orange 2018-03-24 08:00:00
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2895 次 |
| 最近记录: |