如果特定的其他列不为空,如何删除重复项但保留行(熊猫)

SCo*_*ool 5 python duplicates pandas

我有很多重复的记录-其中一些有银行帐户。我想用银行帐户保存记录。

基本上像这样:

if there are two Tommy Joes:
     keep the one with a bank account
Run Code Online (Sandbox Code Playgroud)

我尝试使用下面的代码进行重复数据删除,但是它保留了没有银行帐户的重复数据。

df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'],
                   'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'],
                   'email':['Foo bar','Bar','Foo Bar','jim@com','john@com','mary@com','Jim@com'],
                   'bank':[np.nan,'abc','xyz',np.nan,'tge','vbc','dfg']})


df


  firstname  lastname     email bank
0   foo Bar   Foo Bar   Foo bar  NaN  
1   Bar Bar       Bar       Bar  abc
2   Foo Bar   Foo Bar   Foo Bar  xyz
3       jim      ryan   jim@com  NaN
4      john       con  john@com  tge
5      mary  sullivan  mary@com  vbc
6       jim      Ryan   Jim@com  dfg



# get the index of unique values, based on firstname, lastname, email
# convert to lower and remove white space first

uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index


# save unique records
dfiban_uniq = df.loc[uniq_indx]

dfiban_uniq



  firstname  lastname     email bank
0   foo Bar   Foo Bar   Foo bar  NaN # should not be here
1   Bar Bar       Bar       Bar  abc
3       jim      ryan   jim@com  NaN # should not be here
4      john       con  john@com  tge
5      mary  sullivan  mary@com  vbc


# I wanted these duplicates to appear in the result:

  firstname  lastname     email bank
2   Foo Bar   Foo Bar   Foo Bar  xyz  
6       jim      Ryan   Jim@com  dfg

Run Code Online (Sandbox Code Playgroud)

您可以看到索引0和3被保留。这些具有银行帐户的客户的版本已删除。我的预期结果是反其道而行之。删除没有银行帐户的骗子。

我曾考虑过先按银行帐户进行排序,但是我有很多数据,因此我不确定如何“检查”它是否有效。

任何帮助表示赞赏。

这里有一些类似的问题,但它们似乎都具有可以排序的值,例如年龄等。这些散列的银行帐号非常混乱

编辑:

在我的真实数据集中尝试答案会产生一些结果。

@Erfan的方法按子集+库排序值

重复数据删除后剩余的58594条记录:

subset = ['firstname', 'lastname']

df[subset] = df[subset].apply(lambda x: x.str.lower())
df[subset] = df[subset].apply(lambda x: x.replace(" ", ""))
df.sort_values(subset + ['bank'], inplace=True)
df.drop_duplicates(subset, inplace=True)

print(df.shape[0])

58594 
Run Code Online (Sandbox Code Playgroud)

@ Adam.Er8答案使用按银行排序的值。重复数据删除后剩余的59170条记录:

uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
             .applymap(lambda s: s.lower() if type(s) == str else s)
             .applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
             .drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index

df.loc[uniq_indx].shape[0]

59170
Run Code Online (Sandbox Code Playgroud)

不知道为什么差异,但两者都足够相似。

Ada*_*Er8 8

您应该按bank列对值进行排序,使用na_position='last'(因此.drop_duplicates(..., keep='first')将保留一个不是 na 的值)。

尝试这个:

import pandas as pd
import numpy as np

df = pd.DataFrame({'firstname': ['foo Bar', 'Bar Bar', 'Foo Bar'],
                   'lastname': ['Foo Bar', 'Bar', 'Foo Bar'],
                   'email': ['Foo bar', 'Bar', 'Foo Bar'],
                   'bank': [np.nan, 'abc', 'xyz']})

uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
             .applymap(lambda s: s.lower() if type(s) == str else s)
             .applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
             .drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index

# save unique records
dfiban_uniq = df.loc[uniq_indx]

print(dfiban_uniq)

Run Code Online (Sandbox Code Playgroud)

输出:

  bank    email firstname lastname
1  abc      Bar   Bar Bar      Bar
2  xyz  Foo Bar   Foo Bar  Foo Bar
Run Code Online (Sandbox Code Playgroud)

(这只是您.sort_values(by="bank", na_position='last')在开头的原始代码uniq_indx = ...