在pandas中,如何水平连接然后删除冗余列

Jun*_*ang 6 python pandas

假设我有两个数据帧.

DF1:col1,col2,col3,

DF2:col2,col4,col5

如何水平连接两个数据帧并具有col1,col2,col3,col4和col5?现在,我正在做pd.concat([DF1,DF2],轴= 1),但它最终有两个col2.假设两个col2中的所有值都相同,我想只有一列.

All*_*len 6

删除重复项应该有效.因为drop_duplicates仅适用于索引,所以我们需要转置DF以删除重复项并将其转置回来.

pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T
Run Code Online (Sandbox Code Playgroud)


jez*_*ael 5

使用difference的列从DF2它不是DF1简单的通过选择它们[]

DF1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(columns=['col2', 'col4', 'col5'])


DF2 = DF2[DF2.columns.difference(DF1.columns)]
print (DF2)
Empty DataFrame
Columns: [col4, col5]
Index: []

print (pd.concat([DF1, DF2], axis = 1))
Empty DataFrame
Columns: [col1, col2, col3, col4, col5]
Index: []
Run Code Online (Sandbox Code Playgroud)

时间

np.random.seed(123)

N = 1000
DF1 = pd.DataFrame(np.random.rand(N,3), columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(np.random.rand(N,3), columns=['col2', 'col4', 'col5'])

DF2['col2'] = DF1['col2']

In [408]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
10 loops, best of 3: 122 ms per loop

In [409]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 979 µs per loop
Run Code Online (Sandbox Code Playgroud)
N = 10000:
In [411]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
1 loop, best of 3: 1.4 s per loop

In [412]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 1.12 ms per loop
Run Code Online (Sandbox Code Playgroud)