在连接中广播左表

Zyg*_*ygD 5 anti-join broadcast semi-join apache-spark pyspark

这是我的加入:

df = df_small.join(df_big, 'id', 'leftanti')
Run Code Online (Sandbox Code Playgroud)

看来我只能广播正确的数据帧。但为了让我的逻辑发挥作用(左反连接),我必须将我的df_small放在左侧。

如何广播左侧的数据帧?


例子:

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()

df_small = spark.range(2)
df_big = spark.range(1, 5000000)

#    df_small     df_big
#    +---+        +-------+
#    | id|        |     id|
#    +---+        +-------+
#    |  0|        |      1|
#    |  1|        |      2|
#    +---+        |    ...|
#                 |4999999|
#                 +-------+

df_small = F.broadcast(df_small)
df = df_small.join(df_big, 'id', 'leftanti')
df.show()
df.explain()

#    +---+
#    | id|
#    +---+
#    |  0|
#    +---+
#
#    == Physical Plan ==
#    AdaptiveSparkPlan isFinalPlan=false
#    +- SortMergeJoin [id#197L], [id#199L], LeftAnti
#       :- Sort [id#197L ASC NULLS FIRST], false, 0
#       :  +- Exchange hashpartitioning(id#197L, 200), ENSURE_REQUIREMENTS, [id=#1406]
#       :     +- Range (0, 2, step=1, splits=2)
#       +- Sort [id#199L ASC NULLS FIRST], false, 0
#          +- Exchange hashpartitioning(id#199L, 200), ENSURE_REQUIREMENTS, [id=#1407]
#             +- Range (1, 5000000, step=1, splits=2)
Run Code Online (Sandbox Code Playgroud)

Moh*_*B C 2

不幸的是这是不可能的。

Spark 只能为右外连接广播左侧表。

您可以通过将左反分为 2 个连接(即内连接和左连接)来获得所需的结果。

df1 = spark.createDataFrame([1, 2, 3, 4, 5], IntegerType())
df2 = spark.createDataFrame([(1, 'a'), (2, 'b')], ['value', 'col'])
inner = df1.join(broadcast(df2), 'value', 'inner')
out = df1.join(broadcast(inner), 'value', 'left').where(col('col').isNull()).drop('col')
out.show()
+-----+
|value|
+-----+
|    3|
|    4|
|    5|
+-----+

df1.join(df2, 'value', 'left_anti').show()
+-----+
|value|
+-----+
|    5|
|    3|
|    4|
+-----+
Run Code Online (Sandbox Code Playgroud)