Aye*_*van 27 python merge anti-join dataframe pandas
我有两个表,我想附加它们,以便只保留表A中的所有数据,并且只有在表B的数据唯一时才添加来自表B的数据(键值在表A和B中是唯一的,但在某些情况下是密钥将出现在表A和表B中.
我认为这样做的方法将涉及某种过滤连接(反连接)以获取表B中未在表A中出现的值然后附加两个表.
我熟悉R,这是我在R中用来做这个的代码.
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
Run Code Online (Sandbox Code Playgroud)
我如何在python中实现这一目标?
piR*_*red 21
请考虑以下数据帧
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
Run Code Online (Sandbox Code Playgroud)
TableA
Run Code Online (Sandbox Code Playgroud)
TableB
Run Code Online (Sandbox Code Playgroud)
这是做你想做的事的一种方法
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Run Code Online (Sandbox Code Playgroud)
4行,2行重叠
方法1更快
10,000行5,000重叠
循环很糟糕
小智 8
indicator = True
in merge
命令将通过创建_merge
具有三个可能值的新列来告诉您应用了哪个联接:
left_only
right_only
both
您需要right_only
将其附加回第一张表。这就对了。
而且,不要忘了_merge
在使用后删除列。
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
anti_join_B_only = outer_join[outer_join._merge == 'right_only']
anti_join_B_only = anti_join_B_only.drop('_merge', axis = 1)
combined_table = TableA.merge(anti_join_B_only, how = 'outer')
Run Code Online (Sandbox Code Playgroud)
简单!
我有同样的问题。这个答案使用how='outer'
和indicator=True
的合并启发我想出了这个解决方案:
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='\n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='\n')
Run Code Online (Sandbox Code Playgroud)
哪个打印此输出:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693
Run Code Online (Sandbox Code Playgroud)
可以想象的最简单的答案:
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left", on="key")
answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
Run Code Online (Sandbox Code Playgroud)
也应该是最快的提议。
归档时间: |
|
查看次数: |
13955 次 |
最近记录: |