我想加入两次数据,如下所示:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Run Code Online (Sandbox Code Playgroud)
然后我收到一些错误:
pyspark.sql.utils.AnalysisException:u'Cartesian连接可能非常昂贵,默认情况下禁用.要明确启用它们,请设置spark.sql.crossJoin.enabled = true;'
但我认为这不是交叉连接
更新:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : …Run Code Online (Sandbox Code Playgroud)