当给予大量资源来执行简单计算时,Spark 工作人员“KILLED exitStatus 143”

Bra*_*mon 2 java apache-spark kubernetes

在 Kubernetes 上运行 Spark,每 3 个 Spark 工作线程都有 8 个核心和 8G 内存,结果是

Executor app-xxx-xx/0 finished with state KILLED exitStatus 143
Run Code Online (Sandbox Code Playgroud)

似乎无论计算多么简单或我传递给什么标志spark-submit

例如,

Executor app-xxx-xx/0 finished with state KILLED exitStatus 143
Run Code Online (Sandbox Code Playgroud)

给我以下日志spark-worker-0

21/11/15 22:07:42 INFO DriverRunner: Launch Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.master=spark://spark-master-svc:7077" "-Dspark.driver.cores=4" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.driver.memory=4g" "-Dspark.executor.memory=4g" "-Dspark.submit.pyFiles=" "-Dspark.jars=file:///opt/bitnami/spark/examples/jars/scopt_2.12-3.7.1.jar,file:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar,file:/opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar" "-Dspark.rpc.askTimeout=10s" "-Dspark.app.name=my-pi-calc-example-2" "-Dspark.executor.cores=4" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@xx.xx.19.190:34637" "/opt/bitnami/spark/work/driver-20211115220742-0006/spark-examples_2.12-3.2.0.jar" "org.apache.spark.examples.SparkPi" "3" "--verbose"
21/11/15 22:07:44 INFO Worker: Asked to launch executor app-20211115220744-0006/4 for Spark Pi
21/11/15 22:07:44 INFO SecurityManager: Changing view acls to: spark
21/11/15 22:07:44 INFO SecurityManager: Changing modify acls to: spark
21/11/15 22:07:44 INFO SecurityManager: Changing view acls groups to:
21/11/15 22:07:44 INFO SecurityManager: Changing modify acls groups to:
21/11/15 22:07:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
21/11/15 22:07:44 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.driver.port=42013" "-Dspark.rpc.askTimeout=10s" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark-worker-0.spark-headless.redacted.svc.cluster.local:42013" "--executor-id" "4" "--hostname" "xx.xx.19.190" "--cores" "4" "--app-id" "app-20211115220744-0006" "--worker-url" "spark://Worker@xx.xx.19.190:34637"
21/11/15 22:07:48 INFO Worker: Asked to kill executor app-20211115220744-0006/4
21/11/15 22:07:48 INFO ExecutorRunner: Runner thread for executor app-20211115220744-0006/4 interrupted
21/11/15 22:07:48 INFO ExecutorRunner: Killing process!
21/11/15 22:07:48 INFO Worker: Executor app-20211115220744-0006/4 finished with state KILLED exitStatus 143
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 4
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20211115220744-0006, execId=4)
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Application app-20211115220744-0006 removed, cleanupLocalDirs = true
21/11/15 22:07:48 INFO Worker: Cleaning up local directories for application app-20211115220744-0006
21/11/15 22:07:48 INFO Worker: Driver driver-20211115220742-0006 exited successfully
Run Code Online (Sandbox Code Playgroud)

我可以删除、更改或修改run-examplespark-submit标志。它似乎没有任何效果,即使对于像这样简单的事情也是如此SparkPi 3;执行者被杀死并退出代码 143,但关于他们实际被杀死的原因的信息很少。

资源限制在这里不应该成为问题。这是一个 Kubernetes 集群,由 3 个 AWS m5.4xlarge 工作节点、16 个 vCPu 和 64GiB RAM 组成,实际部署的其他内容很少。我还没有将 Kubernetes 设置spec.resourceslimitsrequests。Spark集群部署如下:

kubectl run -n redacted spark-client --rm -it --restart='Never' \
  --image docker.io/bitnami/spark:3.2.0-debian-10-r2 \
  -- run-example \
    --name my-pi-calc-example-2 \
    --master spark://spark-master-svc:7077 \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 1g \
    --driver-cores 4 \
    --executor-cores 4 \
    --verbose \
    SparkPi 3
Run Code Online (Sandbox Code Playgroud)

这使用Spark Bitnami Helm 图表和 ArgoCD/Helm 进行部署。

集群部署得很好;例如,我可以看到Starting Spark worker xxx.xx.xx.xx:46105 with 8 cores, 8.0 GiB RAM所有 3 名工人都已加入。

我在这里缺少什么?我怎样才能更好地调试它并找出资源限制是什么?


有趣的是,我什至可以在本地运行 SparkPi。如果我例如kubectl exec -it spark-worker-0 -- bash

21/11/15 22:07:42 INFO DriverRunner: Launch Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.master=spark://spark-master-svc:7077" "-Dspark.driver.cores=4" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.driver.memory=4g" "-Dspark.executor.memory=4g" "-Dspark.submit.pyFiles=" "-Dspark.jars=file:///opt/bitnami/spark/examples/jars/scopt_2.12-3.7.1.jar,file:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar,file:/opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar" "-Dspark.rpc.askTimeout=10s" "-Dspark.app.name=my-pi-calc-example-2" "-Dspark.executor.cores=4" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@xx.xx.19.190:34637" "/opt/bitnami/spark/work/driver-20211115220742-0006/spark-examples_2.12-3.2.0.jar" "org.apache.spark.examples.SparkPi" "3" "--verbose"
21/11/15 22:07:44 INFO Worker: Asked to launch executor app-20211115220744-0006/4 for Spark Pi
21/11/15 22:07:44 INFO SecurityManager: Changing view acls to: spark
21/11/15 22:07:44 INFO SecurityManager: Changing modify acls to: spark
21/11/15 22:07:44 INFO SecurityManager: Changing view acls groups to:
21/11/15 22:07:44 INFO SecurityManager: Changing modify acls groups to:
21/11/15 22:07:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
21/11/15 22:07:44 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.driver.port=42013" "-Dspark.rpc.askTimeout=10s" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@spark-worker-0.spark-headless.redacted.svc.cluster.local:42013" "--executor-id" "4" "--hostname" "xx.xx.19.190" "--cores" "4" "--app-id" "app-20211115220744-0006" "--worker-url" "spark://Worker@xx.xx.19.190:34637"
21/11/15 22:07:48 INFO Worker: Asked to kill executor app-20211115220744-0006/4
21/11/15 22:07:48 INFO ExecutorRunner: Runner thread for executor app-20211115220744-0006/4 interrupted
21/11/15 22:07:48 INFO ExecutorRunner: Killing process!
21/11/15 22:07:48 INFO Worker: Executor app-20211115220744-0006/4 finished with state KILLED exitStatus 143
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 4
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20211115220744-0006, execId=4)
21/11/15 22:07:48 INFO ExternalShuffleBlockResolver: Application app-20211115220744-0006 removed, cleanupLocalDirs = true
21/11/15 22:07:48 INFO Worker: Cleaning up local directories for application app-20211115220744-0006
21/11/15 22:07:48 INFO Worker: Driver driver-20211115220742-0006 exited successfully
Run Code Online (Sandbox Code Playgroud)

然后我可以添加两个参数以集群模式运行,然后执行器就会被杀死:

$ ./bin/run-example \
    --master spark://spark-master-svc:7077 \
    --deploy-mode cluster SparkPi
# Executor app-20211115222530-0008/2 finished with state KILLED exitStatus 143
Run Code Online (Sandbox Code Playgroud)

Bra*_*mon 5

在这里学到了一些东西。首先,143 KILLED 似乎实际上并不表示失败,而是执行程序收到作业完成后关闭的信号。因此,在日志中发现时似乎很严厉,但事实并非如此。

让我困惑的是,我在 stdout/stderr 上没有看到任何“Pi 大约为 3.1475357376786883”文本。这让我相信计算永远不会达到那么远,这是不正确的。

这里的问题是我在这种情况下使用的--deploy-mode cluster实际上--deploy-mode client更有意义。这是因为我正在运行一个临时容器,kubectl run该容器不属于现有部署的一部分。这更符合客户端模式的定义,因为提交不是来自现有的 Spark 工作线程。在 中运行时--deploy-mode=cluster,您实际上永远不会看到标准输出,因为应用程序的输入/输出未附加到控制台。

一旦我更改--deploy-modeclient,我还需要按照此处此处--conf spark.driver.host的记录进行添加,以便 Pod 能够解析回调用主机。

kubectl run -n redacted spark-client --rm -it --restart='Never' \
  --image docker.io/bitnami/spark:3.2.0-debian-10-r2 \
  -- /bin/bash -c '
run-example \
  --name my-pi-calc-example \
  --master spark://spark-master-svc:7077 \
  --deploy-mode client \
  --conf spark.driver.host=$(hostname -i) \
  SparkPi 10'
Run Code Online (Sandbox Code Playgroud)

输出:

21/11/15 23:22:16 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
21/11/15 23:22:16 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 2.961188 s
Pi is roughly 3.140959140959141
21/11/15 23:22:16 INFO SparkUI: Stopped Spark web UI at http://xx.xx.xx.xx:4040
21/11/15 23:22:16 INFO StandaloneSchedulerBackend: Shutting down all executors
21/11/15 23:22:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Run Code Online (Sandbox Code Playgroud)

有趣的是app-20211115232213-0024,在 Spark Master UI 中仍然将每个工作程序显示为 KILLED 143 - 强化了这是一个“正常”关闭信号的结论。