比较PandaS DataFrames并返回第一个丢失的行

run*_*i74 4 python dataframe pandas

我有2个dataFrames并希望比较它们并返回第一个(df1)中不在第二个(df2)中的行.我找到了一种方法来比较它们并返回差异,但无法弄清楚如何从df1只返回丢失的那些.

import pandas as pd
from pandas import Series, DataFrame

df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )

df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )



df = pd.concat([df1, df2])
df = df.reset_index(drop=True)

df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
Run Code Online (Sandbox Code Playgroud)

jab*_*lcu 7

以@ EdChum的建议为基础:

df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]

rows_in_df1_not_in_df2

|Index |City        |State       |
|------|------------|------------|
|1     |San Franciso|California  |
|2     |Boston      |Massachusett|
Run Code Online (Sandbox Code Playgroud)


EdC*_*ica 5

IIUC然后如果您使用的是熊猫版本,0.17.0则可以使用merge和设置indicator=True

In [80]:
df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )
?
df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)

Out[80]:
           City         State      _merge
0       Chicago      Illinois        both
1  San Franciso    California   left_only
2        Boston  Massachusett   left_only
3      Mmmmiami       Florida  right_only
4        Dallas         Texas  right_only
5         Omaha      Nebraska  right_only
Run Code Online (Sandbox Code Playgroud)

这将添加一列以指示行仅在lhs还是rhs中出现