Python PANDAS df.duplicate 和 df.drop_duplicated 未找到所有重复项

Yal*_*man 6 python pandas

我在使用重复或 drop_duplicates 来查找/删除数据框中的所有重复项时遇到问题。

我的数据看起来像这样,但是,我的数据有 52k 行长。

data = {'inventory number':['WL-SMART-INWALL',
                         'WL-NMDISH-22',
                         'WL-MPS546-MESH',
                         'WAS-WG-500P',
                         'UKS/99757/69975',
                         'UKS/99757/69975',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99733/69973',
                         'UKS/99733/69973',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99725/PMK01',
                         'UKS/99725/PMK01',
                         'UKS/99294/A3L791R15MS',
                         'UKS/99294/A3L791R15MS',
                         'UKS/98865/58018251',
                         'UKS/98865/58018251',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97627/FIBLCLC1',
                         'UKS/97627/FIBLCLC1'],
        'minimum price': ['36.85',
                         '55.45',
                         '361.29',
                         '265.0',
                         '22.46',
                         '22.46',
                         '15.0',
                         '15.0',
                         '26.71',
                         '26.71',
                         '104.0',
                         '104.0',
                         '32.3',
                         '32.3',
                         '22.51',
                         '22.51',
                         '13.0',
                         '13.0',
                         '9.59',
                         '9.59',
                         '15.0',
                         '15.0',
                         '15.0',
                         '15.0'],
    'cost':['26.11',
                         '39.23',
                         '254.99',
                         '187.09',
                         '16.0',
                         '16.0',
                         '10.7',
                         '10.7',
                         '19.0',
                         '19.0',
                         '73.46',
                         '73.46',
                         '23.0',
                         '23.0',
                         '16.0',
                         '16.0',
                         '9.29',
                         '9.29',
                         '7.0',
                         '7.0',
                         '10.7',
                         '10.7',
                         '10.7',
                         '10.7']
   }
df = pd.DataFrame(data=data)
Run Code Online (Sandbox Code Playgroud)

我通过将上周的目录附加到本周的底部来生成我的数据框。我只想对已更改的“库存编号”执行某些操作,或者我想要增量。我原以为我可以附加这两个,确保它们是相同的数据类型,重新索引,并删除重复项,但是当我将 CSV 写入 QA 时,仍然有数千个重复项。

这是我的代码:

_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory  number','minimum price','cost'],keep=False)==False]
Run Code Online (Sandbox Code Playgroud)

我也尝试过:

overlap_dedupe = overlap.drop_duplicates(keep=False)
Run Code Online (Sandbox Code Playgroud)

所以,我知道我遇到了某种编码问题,因为现在我没有得到重复项。

combined.head(50).duplicated()
Run Code Online (Sandbox Code Playgroud)

返回:

42736    False
32567    False
43033    False
33212    False
46592    False
46023    False
32568    False
33520    False
32756    False
26741    False
46133    False
42737    False
42480    False
40227    False
40562    False
49623    False
27712    False
31848    False
49794    False
27296    False
38198    False
35674    False
27907    False
22210    False
40563    False
18025    False
49624    False
18138    False
19357    False
43698    False
24398    False
50566    False
22276    False
38382    False
20507    False
43550    False
18150    False
29968    False
19247    False
47706    False
19248    False
43955    False
20731    False
38199    False
44168    False
17580    False
15944    False
44891    False
28327    False
16027    False
dtype: bool
Run Code Online (Sandbox Code Playgroud)

piR*_*red 2

drop_duplicates
在此输入图像描述


duplicated
在此输入图像描述


这些是可以很好地协同工作的姊妹函数。

使用你的df

df = pd.read_json(
    ''.join(
        ['[[26.11,"WL-SMART-INWALL",36.85],[39.23,"WL-NMDISH-22",55.45',
         '],[73.46,"UKS\\/99727\\/AHD6502TU3CBK",104.0],[73.46,"UKS\\/997',
         '27\\/AHD6502TU3CBK",104.0],[23.0,"UKS\\/99725\\/PMK01",32.3],[2',
         '3.0,"UKS\\/99725\\/PMK01",32.3],[16.0,"UKS\\/99294\\/A3L791R15MS',
         '",22.51],[16.0,"UKS\\/99294\\/A3L791R15MS",22.51],[9.29,"UKS\\/',
         '98865\\/58018251",13.0],[9.29,"UKS\\/98865\\/58018251",13.0],[7',
         '.0,"UKS\\/98509\\/90Q653AN1N0N2UA0",9.59],[7.0,"UKS\\/98509\\/90',
         'Q653AN1N0N2UA0",9.59],[254.99,"WL-MPS546-MESH",361.29],[10.7',
         ',"UKS\\/97771\\/FIBLCSC2",15.0],[10.7,"UKS\\/97771\\/FIBLCSC2",1',
         '5.0],[10.7,"UKS\\/97627\\/FIBLCLC1",15.0],[10.7,"UKS\\/97627\\/F',
         'IBLCLC1",15.0],[187.09,"WAS-WG-500P",265.0],[16.0,"UKS\\/9975',
         '7\\/69975",22.46],[16.0,"UKS\\/99757\\/69975",22.46],[10.7,"UKS',
         '\\/99750\\/S26361F2293L10",15.0],[10.7,"UKS\\/99750\\/S26361F229',
         '3L10",15.0],[19.0,"UKS\\/99733\\/69973",26.71],[19.0,"UKS\\/997',
         '33\\/69973",26.71]]']
    )
)
Run Code Online (Sandbox Code Playgroud)

我们可以清楚地看到有重复的

df.duplicated()

0     False
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9      True
10    False
11     True
12    False
13    False
14     True
15    False
16     True
17    False
18    False
19     True
20    False
21     True
22    False
23     True
dtype: bool
Run Code Online (Sandbox Code Playgroud)

因为我们没有传递keep参数,所以我们假设默认值是keep='first'。这意味着该系列中的每一行都指示与其上方状态True为 的另一行重复的行。duplicatedFalse

我们可以缩短这个过程,然后返回是否存在重复项的答案

df.duplicated().any()

True
Run Code Online (Sandbox Code Playgroud)

我们可以drop_duplicates通过在调用后链接我们方便的重复测试来验证它是否做了任何事情drop_duplicates

df.drop_duplicates().duplicated().any()

False
Run Code Online (Sandbox Code Playgroud)

伟大的!有效。
这可以保存为

df =  df.drop_duplicates()
df

         0                           1       2
0    26.11             WL-SMART-INWALL   36.85
1    39.23                WL-NMDISH-22   55.45
2    73.46     UKS/99727/AHD6502TU3CBK  104.00
4    23.00             UKS/99725/PMK01   32.30
6    16.00       UKS/99294/A3L791R15MS   22.51
8     9.29          UKS/98865/58018251   13.00
10    7.00  UKS/98509/90Q653AN1N0N2UA0    9.59
12  254.99              WL-MPS546-MESH  361.29
13   10.70          UKS/97771/FIBLCSC2   15.00
15   10.70          UKS/97627/FIBLCLC1   15.00
17  187.09                 WAS-WG-500P  265.00
18   16.00             UKS/99757/69975   22.46
20   10.70    UKS/99750/S26361F2293L10   15.00
22   19.00             UKS/99733/69973   26.71
Run Code Online (Sandbox Code Playgroud)

只想确认一下

df.duplicated().any()

False
Run Code Online (Sandbox Code Playgroud)

结论
它对我来说效果很好。希望这个演示能够帮助您解决出现的问题。