在python中合并数据帧时重复的行

Rob*_*tti 10 python merge python-2.7 python-3.x pandas

我目前正在使用外连接合并2个数据帧,但在合并之后,我看到所有行都是重复的,即使我合并的列包含相同的值.详细地:

list_1 = pd.read_csv('list_1.csv')
list_2 = pd.read_csv('list_2.csv')

merged_list = pd.merge(list_1 , list_2 , on=['email_address'], how='inner')
Run Code Online (Sandbox Code Playgroud)

具有以下输入和结果:

LIST_1:

email_address, name, surname
john.smith@email.com, john, smith
john.smith@email.com, john, smith
elvis@email.com, elvis, presley
Run Code Online (Sandbox Code Playgroud)

list_2:

email_address, street, city
john.smith@email.com, street1, NY
john.smith@email.com, street1, NY
elvis@email.com, street2, LA
Run Code Online (Sandbox Code Playgroud)

merged_list:

email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA
elvis@email.com, elvis, presley, street2, LA
Run Code Online (Sandbox Code Playgroud)

我的问题是,不应该这样吗?

merged_list(我希望如何:D):

email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA
Run Code Online (Sandbox Code Playgroud)

我怎样才能让它变成这样?非常感谢你的帮助!

piR*_*red 20

list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

期望重复的行.每个约翰史密斯都list_1与每个约翰史密斯相匹配list_2.我不得不将重复项删除到其中一个列表中.我选择了list_2.

  • @RobertoBertinetti,请考虑[接受](http://meta.stackexchange.com/a/5235)并提出答案,如果您认为它已经回答了您的问题 (2认同)

Raf*_*ral 11

不要在合并之前删除重复项,而是在合并之后!

最好的解决方案是进行合并,然后删除重复项。

在你的情况下:

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)

  • 为什么合并后需要删除重复项?为什么合并会产生重复项?你能帮我理解一下吗?非常感谢。 (2认同)