遇到 SparkException“无法广播大于 8GB 的​​表”

Lin*_*inh 2 apache-spark spark-dataframe

我正在使用 Spark 2.2.0 进行数据处理。我正在使用 Dataframe.join 将 2 个数据帧连接在一起,但是我遇到了这个堆栈跟踪:

18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
    at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
    at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
    ...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
    at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Run Code Online (Sandbox Code Playgroud)

我在互联网上搜索了这个错误,但没有得到任何提示或解决方案来解决这个问题。

Spark 是否会在加入过程中自动广播 Dataframe?我对这个 8GB 的​​限制感到非常惊讶,因为我本以为 Dataframe 支持“大数据”,而 8GB 根本不是很大。

非常感谢您就此提出建议。灵

Avi*_*rya 7

目前 Spark 中对广播变量大小应小于 8GB 是一个硬性限制。看这里

8GB的容量一般来说已经够用了。如果您认为正在运行具有 100 个执行器的作业,则 Spark 驱动程序需要将 8GB 数据发送到 100 个节点,从而产生 800GB 网络流量。如果不广播而使用简单连接,这个成本会少很多。

如果您确实需要更改自动广播限制,可以使用以下配置

spark.sql.autoBroadcastJoinThreshold: -1
Run Code Online (Sandbox Code Playgroud)


Lin*_*inh 5

经过一些阅读,我尝试禁用自动广播,它似乎有效。使用以下命令更改 Spark 配置:

'spark.sql.autoBroadcastJoinThreshold': '-1'
Run Code Online (Sandbox Code Playgroud)

  • 我正在使用这个 'spark.sql.autoBroadcastJoinThreshold': '-1' 但仍然遇到同样的错误,我还应该尝试什么? (2认同)