我在由三个节点组成的集群上运行一个火花流应用程序,每个节点都有一个worker和三个执行器(所以总共有9个执行器).我正在使用spark独立模式(版本2.1.1).
应用程序使用带选项--deploy-mode client和的spark-submit命令运行--conf spark.streaming.stopGracefullyOnShutdown=true.submit命令从其中一个节点运行,我们称之为节点1.
作为容错测试,我通过调用脚本来停止节点2上的worker stop-slave.sh.
在节点2上的执行程序日志中,我可以看到在shuffle操作期间与FileNotFoundException相关的几个错误:
ERROR Executor: Exception in task 5.0 in stage 5531241.0 (TID 62488319)
java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.ecb8e397-c3a3-4c1a-96ba-e153ed92b05c (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:206)
at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)
我可以在节点2上的3个执行程序中的每个执行程序中看到同一任务中的4种此类错误.
在驱动程序日志中我可以看到:
ERROR TaskSetManager: Task 5 in stage 5531241.0 failed 4 times; aborting job
...
ERROR JobScheduler: Error running job streaming job 1503995015000 ms.1
org.apache.spark.SparkException: …Run Code Online (Sandbox Code Playgroud) apache-spark spark-streaming apache-spark-standalone apache-spark-2.0
我在由三个节点组成的集群上运行一个火花流应用程序,每个节点都有一个工作程序和三个执行程序(因此总共有9个执行程序)。我正在使用Spark版本2.3.2和Spark独立群集管理器。
调查工作计算机完全停机时的最近一个问题,我可以看到由于以下原因,火花流作业已停止:
18/10/08 11:53:03 ERROR TaskSetManager: Task 122 in stage 413804.1 failed 8 times; aborting job
Run Code Online (Sandbox Code Playgroud)
由于同一阶段中的一项任务失败了8次,因此该作业被中止。这是预期的行为。
提到的任务失败,原因如下:
18/10/08 11:53:03 INFO DAGScheduler: ShuffleMapStage 413804 (flatMapToPair at MessageReducer.java:30) failed in 3.817 s due to Job aborted due to stage failure: Task 122 in stage 413804.1 failed 8 times, most recent failure: Lost task 122.7 in stage 413804.1 (TID 223071001, 10.12.101.60, executor 1): java.lang.Exception: Could not compute split, block input-39-1539013586600 of RDD 1793044 not found
org.apache.spark.SparkException: Job aborted due to stage …Run Code Online (Sandbox Code Playgroud) executor apache-spark spark-streaming apache-spark-standalone