Lin*_*inh 2 apache-spark spark-dataframe
我正在使用 Spark 2.2.0 进行数据处理。我正在使用 Dataframe.join 将 2 个数据帧连接在一起,但是我遇到了这个堆栈跟踪:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Run Code Online (Sandbox Code Playgroud)
我在互联网上搜索了这个错误,但没有得到任何提示或解决方案来解决这个问题。
Spark 是否会在加入过程中自动广播 Dataframe?我对这个 8GB 的限制感到非常惊讶,因为我本以为 Dataframe 支持“大数据”,而 8GB 根本不是很大。
非常感谢您就此提出建议。灵
目前 Spark 中对广播变量大小应小于 8GB 是一个硬性限制。看这里。
8GB的容量一般来说已经够用了。如果您认为正在运行具有 100 个执行器的作业,则 Spark 驱动程序需要将 8GB 数据发送到 100 个节点,从而产生 800GB 网络流量。如果不广播而使用简单连接,这个成本会少很多。
如果您确实需要更改自动广播限制,可以使用以下配置
spark.sql.autoBroadcastJoinThreshold: -1
Run Code Online (Sandbox Code Playgroud)
经过一些阅读,我尝试禁用自动广播,它似乎有效。使用以下命令更改 Spark 配置:
'spark.sql.autoBroadcastJoinThreshold': '-1'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7402 次 |
| 最近记录: |