在Python中合并DataFrame而不复制列

Epi*_*est 5 python python-3.x pandas

我正在尝试基于公共列合并多个DataFrame.这将在循环中完成,原始DataFrame可能没有所有列,因此需要外部合并.但是,当我在几个不同的DataFrames列上执行此操作时,使用后缀_x和_y复制.我正在寻找一个填充数据的DataFrame,并且只有在以前不存在的情况下才添加列.

df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
      Company Name  Data1  Data2
0            A      1     13
1            B     34     54
2            C     23   5354
3            D     66    443
Run Code Online (Sandbox Code Playgroud)

第二个DataFrame,包含一些公司的附加信息:

pd.DataFrame({'Company Name':['A','B'],'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})

  Company Name Address  Phone
0            A    str1  str1a
1            B    str2  str2a
Run Code Online (Sandbox Code Playgroud)

如果我想组合这两个,它将使用on = Column成功合并为一个:

df1=pd.merge(df1,df2, on='Company Name', how='outer')

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354     NaN    NaN
3            D     66    443     NaN    NaN
Run Code Online (Sandbox Code Playgroud)

但是,如果我在循环中再次执行相同的命令,或者如果我要将其他DataFrame与其他公司信息合并,我最终会获得类似于以下内容的重复列:

df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
  Company Name  Data1  Data2 Address_x Phone_x Address_y Phone_y
0            A      1     13      str1   str1a       NaN     NaN
1            B     34     54      str2   str2a       NaN     NaN
2            C     23   5354       NaN     NaN      str3   str3a
3            D     66    443       NaN     NaN       NaN     NaN
Run Code Online (Sandbox Code Playgroud)

当我真正想要的是一个具有相同列的DataFrame时,只需填充任何缺失的数据.

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354    str3  str3a
3            D     66    443     NaN    NaN
Run Code Online (Sandbox Code Playgroud)

提前致谢.我已经回顾了之前在重复列上提出的问题,以及对Pandas文档的审核以及任何进展.

Ben*_*n.T 1

当您在循环中寻找一次合并一个数据框时,可以采用以下一种方法:新数据框是否有新公司名称、是否有新列:

df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
                    'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
                          'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
            pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]

for df_other in list_dfo:
    df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
    # and other code
Run Code Online (Sandbox Code Playgroud)

在这个例子的最后:

print(df1)
 Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN
Run Code Online (Sandbox Code Playgroud)

first您可以使用代替last,这将保留最后一个有效值,而不是每组每列中的第一个有效值,这取决于您需要的数据、来自的数据df1或来自df_other可用的数据。在上面的示例中,它不会改变任何内容,但在以下情况下您将看到:

#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})

#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
        .first().reset_index())
Out[21]: 
  Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a   #address is str1 from df1
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN

#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
         .last().reset_index())
Out[22]: 
  Company Name  Data1   Data2   Address  Phone
0            A    1.0    13.0  new_str1  str1a   #address is new_str1 from df4
1            B   34.0    54.0      str2  str2a
2            C   23.0  5354.0      str3  str3a
3            D   66.0   443.0       NaN    NaN
Run Code Online (Sandbox Code Playgroud)