如何删除重复项并保留熊猫的最后时间戳

Question

如何删除重复项并保留熊猫的最后时间戳

Nab*_*zir 5 python timestamp dataframe pandas

我想删除重复项并保留最后一个时间戳.想要删除的重复项是customer_id和.这var_name是我的数据

    customer_id  value   var_name     timestamp
    1            1       apple        2018-03-22 00:00:00.000        
    2            3       apple        2018-03-23 08:00:00.000
    2            4       apple        2018-03-24 08:00:00.000
    1            1       orange       2018-03-22 08:00:00.000
    2            3       orange       2018-03-24 08:00:00.000
    2            5       orange       2018-03-23 08:00:00.000

Run Code Online (Sandbox Code Playgroud)

结果将是

    customer_id  value   var_name     timestamp
    1            1       apple        2018-03-22 00:00:00.000        
    2            4       apple        2018-03-24 08:00:00.000
    1            1       orange       2018-03-22 08:00:00.000
    2            3       orange       2018-03-24 08:00:00.000

Run Code Online (Sandbox Code Playgroud)

Answer 1

jez*_*ael 7

我认为需要sort_values有drop_duplicates:

df = df.sort_values('timestamp').drop_duplicates(['customer_id','var_name'], keep='last')
print (df)
   customer_id  value var_name                timestamp
0            1      1    apple  2018-03-22 00:00:00.000
3            1      1   orange  2018-03-22 08:00:00.000
2            2      4    apple  2018-03-24 08:00:00.000
4            2      3   orange  2018-03-24 08:00:00.000

Run Code Online (Sandbox Code Playgroud)

如果不需要排序 - 订单很重要:

df = df.loc[df.groupby(['customer_id','var_name'], sort=False)['timestamp'].idxmax()]
print (df)
   customer_id  value var_name           timestamp
0            1      1    apple 2018-03-22 00:00:00
2            2      4    apple 2018-03-24 08:00:00
3            1      1   orange 2018-03-22 08:00:00
4            2      3   orange 2018-03-24 08:00:00

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，11 月前
查看次数：	2895 次
最近记录：	7 年，11 月前