Spark-执行器心跳在X毫秒后超时

Question

Spark-执行器心跳在X毫秒后超时

我的程序从目录中的文件读取数据，这些文件的大小为5 GB。我对这些数据应用了许多功能。我在具有32 GB RAM的虚拟机上作为独立（本地）运行spark。

使用的命令：

bin/spark-submit --class ripeatlasanalysis.AnalyseTraceroute     --master local --driver-memory 30G  SparkExample-lowprints-0.0.5-SNAPSHOT-jar-with-dependencies.jar  1517961600  1518393600 3600

Run Code Online (Sandbox Code Playgroud)

的1517961600 1518393600 3600是jar文件的参数。

有时程序运行时没有错误，有时没有错误，并且得到了错误：

Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 0 in stage 2.0 failed 1 times, most recent   failure: Lost task 
0.0 in stage 2.0 (TID 119, localhost, executor driver):  
ExecutorLostFailure (executor driver exited caused by one of the running   tasks) 
Reason: Executor heartbeat timed out after 128839 ms
 Driver stacktrace:
   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedule  r$$failJobAndIndependentStages(DAGScheduler.scala:1887)
   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)

Run Code Online (Sandbox Code Playgroud)

这里已经问过这个问题，没有回应。

Answer 1

Mou*_*oud 8

我没有找到有关您的程序的太多信息，但是通常这可能是由于网络问题或卡在计算中而引起的，但是您可以执行两个步骤。首先，例如，以更多的数字对工作的数据帧进行分区，或者在进行联接的df.repartition(1000)情况下，可以基于联接列进行分区。您还可以增加maxResultsSize，

第二：您可以增加执行程序和网络超时。

--conf spark.network.timeout 10000000 --conf spark.executor.heartbeatInterval=10000000   --conf spark.driver.maxResultSize=4g

Run Code Online (Sandbox Code Playgroud)

将spark.executor.heartbeatInterval增加到10000000并不是一个好主意。这意味着执行器将每10000000毫秒（即每166分钟）发送一次心跳。另外，将 Spark.network.timeout 增加到 166 分钟也不是一个好主意。驱动程序将等待 166 分钟才能删除执行程序。您听说节拍间隔应该比网络超时小得多。 (2认同)

归档时间：	6 年，10 月前
查看次数：	2647 次
最近记录：	6 年，10 月前