Spark应用程序杀死执行程序

Cor*_*ave 8 apache-spark

我在独立模式下运行spark集群,使用spark-submit运行应用程序.在火花UI阶段我发现执行阶段有大的执行时间(> 10h,通常时间~30秒).阶段有许多失败的任务有错误Resubmitted (resubmitted due to lost executor).没有执行与地址CANNOT FIND ADDRESSAggregated Metrics by Executor舞台页部分.Spark试图无限地重新提交此任务.如果我杀了这个阶段(我的应用程序自动重新运行未完成的火花作业),所有都继续正常工作.

此外,我在spark日志中发现了一些奇怪的条目(与阶段执行开始同时).

主:

16/11/19 19:04:32 INFO Master: Application app-20161109161724-0045 requests to kill executors: 0
16/11/19 19:04:36 INFO Master: Launching executor app-20161109161724-0045/1 on worker worker-20161108150133
16/11/19 19:05:03 WARN Master: Got status update for unknown executor app-20161109161724-0045/0
16/11/25 10:05:46 INFO Master: Application app-20161109161724-0045 requests to kill executors: 1
16/11/25 10:05:48 INFO Master: Launching executor app-20161109161724-0045/2 on worker worker-20161108150133
16/11/25 10:06:14 WARN Master: Got status update for unknown executor app-20161109161724-0045/1
Run Code Online (Sandbox Code Playgroud)

工人:

16/11/25 10:06:05 INFO Worker: Asked to kill executor app-20161109161724-0045/1
16/11/25 10:06:08 INFO ExecutorRunner: Runner thread for executor app-20161109161724-0045/1 interrupted
16/11/25 10:06:08 INFO ExecutorRunner: Killing process!
16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
16/11/25 10:06:14 INFO Worker: Asked to launch executor app-20161109161724-0045/2 for app.jar
16/11/25 10:06:17 INFO SecurityManager: Changing view acls to: spark
16/11/25 10:06:17 INFO SecurityManager: Changing modify acls to: spark
16/11/25 10:06:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
Run Code Online (Sandbox Code Playgroud)

网络连接没有问题,因为worker,master(上面的日志),驱动程序在同一台机器上运行.

Spark 1.6.1版

Arm*_*aun 9

可能日志中有趣的部分是这样的:

16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
Run Code Online (Sandbox Code Playgroud)

退出137强烈建议资源问题,内存或CPU内核.鉴于您可以通过重新运行阶段来解决问题,可能是某些核心已经分配(也许您还运行了一些Spark shell?).这是独立Spark设置的常见问题(一台主机上的所有内容).

无论哪种方式,我都会按顺序尝试以下事项:

  1. 提升存储内存派系spark.storage.memoryFraction以预先分配更多内存用于存储,并防止系统OOM杀手137在大舞台上随机提供.
  2. 为应用程序设置较少数量的核心,以排除在运行阶段之前预先分配这些核心的内容.您可以通过spark.deploy.defaultCores将其设置为3或甚至2(在假设8个vcores的intel四核上)
  3. Outright为Spark分配更多内存 - > spark.executor.memory需要上升.
  4. 也许你在这里遇到了元数据清理的问题,在本地部署中也不是闻所未闻的,在这种情况下
    export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200",spark-env.sh通过强制元数据清理更频繁地运行来添加到最后你可能会做的伎俩

在我看来,其中一个应该成功.