我有两个 pyspark 数据框,如下所示 -
df1
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Sydney Australia AU AU
4 London UK EU EU
Run Code Online (Sandbox Code Playgroud)
df2
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Paris France EU EU
5 London UK EU EU
Run Code Online (Sandbox Code Playgroud)
我想根据所有列值找出 df2 中存在但 df1 中不存在的行。所以 df2 - df1 应该产生如下所示的 df_result
df_结果
id city country region continent
3 Paris France EU EU
5 London UK EU EU
Run Code Online (Sandbox Code Playgroud)
我怎样才能在pyspark中实现它?提前致谢
您可以使用left_anti联接:
df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
Run Code Online (Sandbox Code Playgroud)
如果所有列都有非空值:
df2.join(df1, on = df2.schema.names, how = "left_anti").show()
Run Code Online (Sandbox Code Playgroud)