P.J*_*P.J 3 python dataframe pandas
银行数据帧(DF)中有重复的交易。ID 是客户 ID。重复交易是多次刷卡,供应商不小心在短时间内(此处为 2 分钟)内多次向客户的卡收费。
DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])
ID Dollar transactionDateTime
0 111 1 2016-01-08 19:04:50
1 111 3 2016-01-29 19:03:55
2 111 1 2016-01-08 19:05:50
3 111 10 2016-01-08 20:08:50
4 222 25 2016-01-08 19:04:50
5 222 8 2016-02-08 19:04:50
6 222 25 2016-03-08 19:04:50
7 333 9 2016-01-08 19:04:50
8 333 20 2016-03-08 19:05:53
9 333 9 2016-01-08 19:03:20
10 333 9 2016-01-08 19:02:15
11 111 10 2016-02-08 20:08:50
Run Code Online (Sandbox Code Playgroud)
我想在我的数据框中添加一列,它可以识别重复的交易(同一客户 ID 的金额应该相同,交易日期时间应该少于 2 分钟)。请认为第一笔交易是“正常的”。
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 10 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 Yes
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 No
11 111 10 2016-02-08 20:08:50 No
Run Code Online (Sandbox Code Playgroud)
IIUC,你可以groupby和diff检查连续交易之间的差异是否小于120秒:
df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
.groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
.diff()
.dt.total_seconds()
.lt(120))
df
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 False
1 111 3 2016-01-29 19:03:55 False
2 111 1 2016-01-08 19:05:50 True
3 111 100 2016-01-08 20:08:50 False
4 222 25 2016-01-08 19:04:50 False
5 222 8 2016-02-08 19:04:50 False
6 222 25 2016-03-08 19:04:50 False
7 333 9 2016-01-08 19:04:50 True
8 333 20 2016-03-08 19:05:53 False
9 333 9 2016-01-08 19:03:20 True
10 333 9 2016-01-08 19:02:15 False
11 111 100 2016-02-08 20:08:50 False
Run Code Online (Sandbox Code Playgroud)
请注意,您的数据未排序,因此您必须先对其进行排序以获得有意义的结果。
| 归档时间: |
|
| 查看次数: |
534 次 |
| 最近记录: |