我在使用重复或 drop_duplicates 来查找/删除数据框中的所有重复项时遇到问题。
我的数据看起来像这样,但是,我的数据有 52k 行长。
data = {'inventory number':['WL-SMART-INWALL',
'WL-NMDISH-22',
'WL-MPS546-MESH',
'WAS-WG-500P',
'UKS/99757/69975',
'UKS/99757/69975',
'UKS/99750/S26361F2293L10',
'UKS/99750/S26361F2293L10',
'UKS/99733/69973',
'UKS/99733/69973',
'UKS/99727/AHD6502TU3CBK',
'UKS/99727/AHD6502TU3CBK',
'UKS/99725/PMK01',
'UKS/99725/PMK01',
'UKS/99294/A3L791R15MS',
'UKS/99294/A3L791R15MS',
'UKS/98865/58018251',
'UKS/98865/58018251',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/97771/FIBLCSC2',
'UKS/97771/FIBLCSC2',
'UKS/97627/FIBLCLC1',
'UKS/97627/FIBLCLC1'],
'minimum price': ['36.85',
'55.45',
'361.29',
'265.0',
'22.46',
'22.46',
'15.0',
'15.0',
'26.71',
'26.71',
'104.0',
'104.0',
'32.3',
'32.3',
'22.51',
'22.51',
'13.0',
'13.0',
'9.59',
'9.59',
'15.0',
'15.0',
'15.0',
'15.0'],
'cost':['26.11',
'39.23',
'254.99',
'187.09',
'16.0',
'16.0',
'10.7',
'10.7',
'19.0',
'19.0',
'73.46',
'73.46',
'23.0',
'23.0',
'16.0',
'16.0',
'9.29',
'9.29',
'7.0',
'7.0',
'10.7',
'10.7',
'10.7',
'10.7']
}
df = pd.DataFrame(data=data)
Run Code Online (Sandbox Code Playgroud)
我通过将上周的目录附加到本周的底部来生成我的数据框。我只想对已更改的“库存编号”执行某些操作,或者我想要增量。我原以为我可以附加这两个,确保它们是相同的数据类型,重新索引,并删除重复项,但是当我将 CSV 写入 QA 时,仍然有数千个重复项。
这是我的代码:
_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory number','minimum price','cost'],keep=False)==False]
Run Code Online (Sandbox Code Playgroud)
我也尝试过:
overlap_dedupe = overlap.drop_duplicates(keep=False)
Run Code Online (Sandbox Code Playgroud)
所以,我知道我遇到了某种编码问题,因为现在我没有得到重复项。
combined.head(50).duplicated()
Run Code Online (Sandbox Code Playgroud)
返回:
42736 False
32567 False
43033 False
33212 False
46592 False
46023 False
32568 False
33520 False
32756 False
26741 False
46133 False
42737 False
42480 False
40227 False
40562 False
49623 False
27712 False
31848 False
49794 False
27296 False
38198 False
35674 False
27907 False
22210 False
40563 False
18025 False
49624 False
18138 False
19357 False
43698 False
24398 False
50566 False
22276 False
38382 False
20507 False
43550 False
18150 False
29968 False
19247 False
47706 False
19248 False
43955 False
20731 False
38199 False
44168 False
17580 False
15944 False
44891 False
28327 False
16027 False
dtype: bool
Run Code Online (Sandbox Code Playgroud)
这些是可以很好地协同工作的姊妹函数。
使用你的df
df = pd.read_json(
''.join(
['[[26.11,"WL-SMART-INWALL",36.85],[39.23,"WL-NMDISH-22",55.45',
'],[73.46,"UKS\\/99727\\/AHD6502TU3CBK",104.0],[73.46,"UKS\\/997',
'27\\/AHD6502TU3CBK",104.0],[23.0,"UKS\\/99725\\/PMK01",32.3],[2',
'3.0,"UKS\\/99725\\/PMK01",32.3],[16.0,"UKS\\/99294\\/A3L791R15MS',
'",22.51],[16.0,"UKS\\/99294\\/A3L791R15MS",22.51],[9.29,"UKS\\/',
'98865\\/58018251",13.0],[9.29,"UKS\\/98865\\/58018251",13.0],[7',
'.0,"UKS\\/98509\\/90Q653AN1N0N2UA0",9.59],[7.0,"UKS\\/98509\\/90',
'Q653AN1N0N2UA0",9.59],[254.99,"WL-MPS546-MESH",361.29],[10.7',
',"UKS\\/97771\\/FIBLCSC2",15.0],[10.7,"UKS\\/97771\\/FIBLCSC2",1',
'5.0],[10.7,"UKS\\/97627\\/FIBLCLC1",15.0],[10.7,"UKS\\/97627\\/F',
'IBLCLC1",15.0],[187.09,"WAS-WG-500P",265.0],[16.0,"UKS\\/9975',
'7\\/69975",22.46],[16.0,"UKS\\/99757\\/69975",22.46],[10.7,"UKS',
'\\/99750\\/S26361F2293L10",15.0],[10.7,"UKS\\/99750\\/S26361F229',
'3L10",15.0],[19.0,"UKS\\/99733\\/69973",26.71],[19.0,"UKS\\/997',
'33\\/69973",26.71]]']
)
)
Run Code Online (Sandbox Code Playgroud)
我们可以清楚地看到有重复的
df.duplicated()
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 True
10 False
11 True
12 False
13 False
14 True
15 False
16 True
17 False
18 False
19 True
20 False
21 True
22 False
23 True
dtype: bool
Run Code Online (Sandbox Code Playgroud)
因为我们没有传递keep参数,所以我们假设默认值是keep='first'。这意味着该系列中的每一行都指示与其上方状态True为 的另一行重复的行。duplicatedFalse
我们可以缩短这个过程,然后返回是否存在重复项的答案
df.duplicated().any()
True
Run Code Online (Sandbox Code Playgroud)
我们可以drop_duplicates通过在调用后链接我们方便的重复测试来验证它是否做了任何事情drop_duplicates
df.drop_duplicates().duplicated().any()
False
Run Code Online (Sandbox Code Playgroud)
伟大的!有效。
这可以保存为
df = df.drop_duplicates()
df
0 1 2
0 26.11 WL-SMART-INWALL 36.85
1 39.23 WL-NMDISH-22 55.45
2 73.46 UKS/99727/AHD6502TU3CBK 104.00
4 23.00 UKS/99725/PMK01 32.30
6 16.00 UKS/99294/A3L791R15MS 22.51
8 9.29 UKS/98865/58018251 13.00
10 7.00 UKS/98509/90Q653AN1N0N2UA0 9.59
12 254.99 WL-MPS546-MESH 361.29
13 10.70 UKS/97771/FIBLCSC2 15.00
15 10.70 UKS/97627/FIBLCLC1 15.00
17 187.09 WAS-WG-500P 265.00
18 16.00 UKS/99757/69975 22.46
20 10.70 UKS/99750/S26361F2293L10 15.00
22 19.00 UKS/99733/69973 26.71
Run Code Online (Sandbox Code Playgroud)
只想确认一下
df.duplicated().any()
False
Run Code Online (Sandbox Code Playgroud)
结论
它对我来说效果很好。希望这个演示能够帮助您解决出现的问题。
| 归档时间: |
|
| 查看次数: |
10398 次 |
| 最近记录: |