Pyspark:根据所有列减去/差异 pyspark 数据帧

Cod*_*tor 3 dataframe pyspark

我有两个 pyspark 数据框,如下所示 -

df1

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Sydney     Australia    AU         AU
4      London     UK           EU         EU
Run Code Online (Sandbox Code Playgroud)

df2

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Paris      France       EU         EU
5      London     UK           EU         EU
Run Code Online (Sandbox Code Playgroud)

我想根据所有列值找出 df2 中存在但 df1 中不存在的行。所以 df2 - df1 应该产生如下所示的 df_result

df_结果

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU
Run Code Online (Sandbox Code Playgroud)

我怎样才能在pyspark中实现它?提前致谢

Sur*_*ali 5

您可以使用left_anti联接:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+
Run Code Online (Sandbox Code Playgroud)

如果所有列都有非空值:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()
Run Code Online (Sandbox Code Playgroud)