执行人遗失时会发生什么?

sds*_*sds 10 apache-spark

我收到这些消息:

16/05/22 13:33:53 ERROR YarnScheduler: Lost executor 61 on <host>: Executor heartbeat timed out after 134828 ms
16/05/22 13:33:53 WARN TaskSetManager: Lost task 25.0 in stage 12.0 (TID 2214, <host>): ExecutorLostFailure (executor 61 lost)
Run Code Online (Sandbox Code Playgroud)

是否会产生替代遗嘱执行人?

Yuv*_*kov 13

是否会产生替代遗嘱执行人?

是的,它会的.Sparks DAGScheduler及其较低级别的集群管理器实现(Standalone,YARN或Mesos)将注意到任务失败,并将负责重新安排所述任务,作为执行的整个阶段的一部分.

DAGScheduler

DAGScheduler在Spark中做了三件事(详细解释如下):

  • 计算作业的执行DAG,即阶段的DAG.
  • 确定运行每个任务的首选位置.
  • 处理由于shuffle输出文件丢失而导致的故障.

有关更多信息,您可以在Advanced Spark TutorialMastering Apache Spark中找到.


Ram*_*ram 5

是。它将尝试重新提交丢失的执行器,并将尝试重播该执行器。请参阅下面的日志。

16/02/27 21:37:01 ERROR cluster.YarnScheduler: Lost executor 6 on
ip-10-0-0-156.ec2.internal: remote Akka client disassociated
16/02/27 21:37:01 WARN remote.ReliableDeliverySupervisor:
Association with remote system [akka.tcp://sparkExecutor@ip-10-0-0-
156.ec2.internal:39097]
has failed, address is now gated for [5000] ms. Reason is:
[Disassociated].
16/02/27 21:37:01 INFO scheduler.TaskSetManager: Re-queueing tasks
for 6 from TaskSet 1.0
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 92), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 88), so marking it as still running
16/02/27 21:37:01 WARN scheduler.TaskSetManager: Lost task 146.0
in stage 1.0 (TID 1151, ip-10-0-0-156.ec2.internal): ExecutorLostFailure
(executor 6
lost)
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 93), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 89), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 87), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 90), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 91), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 85), so marking it as still running
16/02/27 21:37:01 INFO storage.BlockManagerMasterActor: Trying to
remove executor 6 from BlockManagerMaster.
16/02/27 21:37:02 INFO storage.BlockManagerMasterActor: Removing
block manager BlockManagerId(6, ip-10-0-0-156.ec2.internal, 34952)
16/02/27 21:37:02 INFO storage.BlockManagerMaster: Removed 6
successfully in removeExecutor
16/02/27 21:37:02 INFO scheduler.Stage: Stage 1 is now unavailable
on executor 6 (536/598, false)
16/02/27 21:37:17 INFO scheduler.TaskSetManager: Starting task
146.1 in stage 1.0 (TID 1152, ip-10-0-0-154.ec2.internal, RACK_LOCAL,
1396 bytes)
16/02/27 21:37:17 WARN scheduler.TaskSetManager: Lost task 123.0
in stage 1.0 (TID 1148, ip-10-0-0-154.ec2.internal): java.io.IOException:
Failed to
connect to ip-10-0-0-156.ec2.internal/10.0.0.156:34952
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 86), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Resubmitted
ShuffleMapTask(1, 94), so marking it as still running
16/02/27 21:37:01 INFO scheduler.DAGScheduler: Executor lost: 6
(epoch 0)
Run Code Online (Sandbox Code Playgroud)

解决的办法是增加spark.yarn.executor.memoryOverhead直到消失。这将控制JVM堆大小和YARN请求的内存量之间的缓冲区(JVM会占用超出堆大小的内存)。您还需要确保在YARN NodeManager配置yarn.nodemanager.vmem-check-enabled中将其设置为false。错误的原因是,这将阻止NM保持对容器的控制。如果容器中的物理内存不足,请确保JVM堆大小足够小以适合容器。

请参见下图以更好地理解它。

在此处输入图片说明 容器的大小应足够大以容纳:

  • JVM堆

  • JVM的永久生成

  • 任何堆外分配

在大多数情况下,JVM堆的15%-30%的开销就足够了。您的作业配置应包括正确的JVM和容器设置。一些工作将需要更多,而一些工作将需要较少的开销。