根据连续行之间的时间差标记重复项

P.J*_*P.J 3 python dataframe pandas

银行数据帧(DF)中有重复的交易。ID 是客户 ID。重复交易是多次刷卡,供应商不小心在短时间内(此处为 2 分钟)内多次向客户的卡收费。

DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])

    ID  Dollar  transactionDateTime
0   111     1   2016-01-08 19:04:50
1   111     3   2016-01-29 19:03:55
2   111     1   2016-01-08 19:05:50
3   111     10  2016-01-08 20:08:50
4   222     25  2016-01-08 19:04:50
5   222     8   2016-02-08 19:04:50
6   222     25  2016-03-08 19:04:50
7   333     9   2016-01-08 19:04:50
8   333     20  2016-03-08 19:05:53
9   333     9   2016-01-08 19:03:20
10  333     9   2016-01-08 19:02:15
11  111     10  2016-02-08 20:08:50
Run Code Online (Sandbox Code Playgroud)

我想在我的数据框中添加一列,它可以识别重复的交易(同一客户 ID 的金额应该相同,交易日期时间应该少于 2 分钟)。请认为第一笔交易是“正常的”。

    ID  Dollar  transactionDateTime     Duplicated?
0   111     1   2016-01-08 19:04:50     No
1   111     3   2016-01-29 19:03:55     No
2   111     1   2016-01-08 19:05:50     Yes
3   111     10  2016-01-08 20:08:50     No
4   222     25  2016-01-08 19:04:50     No
5   222     8   2016-02-08 19:04:50     No
6   222     25  2016-03-08 19:04:50     No
7   333     9   2016-01-08 19:04:50     Yes
8   333     20  2016-03-08 19:05:53     No
9   333     9   2016-01-08 19:03:20     Yes
10  333     9   2016-01-08 19:02:15     No
11  111     10  2016-02-08 20:08:50     No
Run Code Online (Sandbox Code Playgroud)

cs9*_*s95 5

IIUC,你可以groupbydiff检查连续交易之间的差异是否小于120秒:

df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
                       .groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
                       .diff()
                       .dt.total_seconds()
                       .lt(120))
df

     ID  Dollar transactionDateTime  Duplicated?
0   111       1 2016-01-08 19:04:50        False
1   111       3 2016-01-29 19:03:55        False
2   111       1 2016-01-08 19:05:50         True
3   111     100 2016-01-08 20:08:50        False
4   222      25 2016-01-08 19:04:50        False
5   222       8 2016-02-08 19:04:50        False
6   222      25 2016-03-08 19:04:50        False
7   333       9 2016-01-08 19:04:50         True
8   333      20 2016-03-08 19:05:53        False
9   333       9 2016-01-08 19:03:20         True
10  333       9 2016-01-08 19:02:15        False
11  111     100 2016-02-08 20:08:50        False
Run Code Online (Sandbox Code Playgroud)

请注意,您的数据未排序,因此您必须先对其进行排序以获得有意义的结果。