spark:salting 如何处理偏斜数据

Bis*_*Ten 4 join group-by skew apache-spark apache-spark-sql

我在一个表中有一个倾斜的数据,然后将它与其他小的表进行比较。我知道在连接的情况下加盐工作 - 即随机数附加到大表中的键,其中包含来自一系列随机数据的倾斜数据,并且小表中没有倾斜数据的行与相同范围的随机数重复. 因此,匹配发生是因为在偏斜特定键的重复值中会有一个命中我还读到在执行 groupby 时加盐是有帮助的。我的问题是当随机数附加到密钥时,它不会破坏组吗?如果是,则 group by 操作的含义已更改。

Gel*_*ion 10

My question is when random numbers are appended to the key doesn't it break the group?

Well, it does, to mitigate this you could run group by operation twice. Firstly with salted key, then remove salting and group again. The second grouping will take partially aggregated data, thus significantly reduce skew impact.

E.g.

import org.apache.spark.sql.functions._

df.withColumn("salt", (rand * n).cast(IntegerType))
  .groupBy("salt", groupByFields)
  .agg(aggFields)
  .groupBy(groupByFields)
  .agg(aggFields)
Run Code Online (Sandbox Code Playgroud)