在Spark中运行任务时出错ExecutorLostFailure

Use*_*r17 9 collect apache-spark pyspark apache-spark-mllib

当我试图在这个文件夹上运行它时,它每次都抛出ExecutorLostFailure

嗨,我是Spark的初学者.我试图在Spark 1.4.1上运行一个带有8个从属节点的工作,每个3.2 GB磁盘有11.7 GB内存.我正在从一个从节点(来自8个节点)运行Spark任务(因此,每个节点上只有大约4.8 gb的0.7存储分数)并使用Mesos作为Cluster Manager.我正在使用此配置:

spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true
Run Code Online (Sandbox Code Playgroud)

我试图在14 GB的数据文件夹上运行Spark MLlib朴素贝叶斯算法.(当我在6 GB文件夹上运行任务时没有问题)我正在从谷歌存储中读取此文件夹作为RDD并将32作为分区参数.(我也尝试过增加分区).然后使用TF创建特征向量并基于此进行预测.但是当我试图在这个文件夹上运行它时,它每次都会抛出ExecutorLostFailure.我尝试了不同的配置,但没有任何帮助.可能是我遗漏了一些非常基本但却无法弄清楚的东西.任何帮助或建议都将非常有价值.

日志是:

   15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled    
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s    
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)    
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.    
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s    
Traceback (most recent call last):    
  File "/opt/work/V2ProcessRecords.py", line 213, in <module>
    secondPassRDD = firstPassRDD.map(lambda ( name, title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()    
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)    
Driver stacktrace:    
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
        at       org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
        at     scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at     org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook



{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
Run Code Online (Sandbox Code Playgroud)

{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"} {"Event":" SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526- 135628-3255597322-5050-1304-S8"},"任务信息":{"任务ID":11,"索引":6,"尝试":2,"启动时间":1437616381852,"执行者ID":" 20150526-135628-3255597322-5050-1304-S8" , "主机": "uc1f-bioinfocloud鞋面 - 间2.c.quantum装置-854.internal", "局部性": "PROCESS_LOCAL", "投机" :false,"获取结果时间":0,"结束时间":1437616389697,"失败":true,"可累积":[]}} {"事件":"SparkListenerExecutorAdded","时间戳":1437616389707,"执行者ID ":"20150526-135628-3255597322-5050-1304-S8","执行者信息":{"主持人":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","总计核心":1,"Log Urls":{}}} {"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"任务ID":12,"索引":6,"尝试":3,"启动时间":1437616389702,"执行者ID":"20150526-135628-3255597322-5050-1304-S8","主持人": "uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0 ,"失败":false,"Accumulables":[]}} {"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","删除原因":"丢失执行者"} {"事件":"SparkListenerTaskEnd","阶段ID":2,"阶段尝试ID":0,"任务类型":"ResultTask","任务结束原因":{"原因":"ExecutorLostFailure","执行者ID":"20150526-135628-3255597322-5050-1304-S8"},"任务信息":{"任务ID":12,"索引":6,"尝试": 3,"发射时间":1437616389702,"执行者ID":"20150526-135628-3255597322-5050-1304-S8","主持人":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854 .internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616397743,"Failed":true,"Accumulables":[]}} {"Event ":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215","Number of任务":72,"RDD信息":[{"RDD ID":6,"名称":"PythonRDD","父ID":[0],"存储级别":{"使用磁盘":false,"使用内存":false","使用ExternalBlockStore":false,"反序列化":false,"复制":1},"分区数":72,"缓存分区数":0,"内存大小":0, "ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/*.nxml ",""范围":"{\"id \":\"0 \",\"name \":\"wholeTextFiles \"}","父ID":[],"存储级别":{"使用磁盘":false","使用内存":false,"使用ExternalBlockStore":false,"反序列化":false,"复制":1},"分区数":72,"缓存分区数":0,"内存大小":0,"ExternalBlockStore大小":0,"磁盘大小":0}],"父ID":[],"详细信息":"","提交时间":1437616365566,"完成时间":1437616397753 ,"失败原因":"由于阶段失败导致工作中止:任务6中 阶段2.0失败4次,最近失败:阶段2.0中失去的任务6.3(TID 12,uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal):ExecutorLostFailure(执行人20150526-135628-3255597322- 5050-1304-S8丢失)\nDriver stacktrace:","Accumulables":[]}} {"Event":"SparkListenerJobEnd","作业ID":2,"完成时间":1437616397755,"作业结果":{ "结果":"JobFailed","异常":{"消息":"由于阶段失败导致作业中止:阶段2.0中的任务6失败4次,最近失败:阶段2.0中失去的任务6.3(TID 12,uc1f- bioinfocloud-vamp-m-2.c.quantum-device-854.internal):ExecutorLostFailure(执行者20150526-135628-3255597322-5050-1304-S8丢失)\nDriver stacktrace:","Stack Trace":[{"声明类":"org.apache.spark.scheduler.DAGScheduler","方法名称":"org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages","文件名":"DAGScheduler.scala","行号" :1266},{"声明类":"org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1","方法名称":"应用","文件 名称":"DAGScheduler.scala","行号":1257},{"声明类":"org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1","方法名称":"应用", "文件名":"DAGScheduler.scala","行号":1256},{"声明类":"scala.collection.mutable.ResizableArray $ class","方法名称":"foreach","文件名" :"ResizableArray.scala","行号":59},{"声明类":"scala.collection.mutable.ArrayBuffer","方法名称":"foreach","文件名":"ArrayBuffer.scala" ,"行号":47},{"声明类":"org.apache.spark.scheduler.DAGScheduler","方法名称":"abortStage","文件名":"DAGScheduler.scala","行号":1256},{"声明类":"org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1","方法名称":"应用","文件名":"DAGScheduler.scala","行号":730},{"声明类":"org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1","方法名称":"应用","文件名":"DAGScheduler.scala" ,"行号":730},{"声明类":"scala.Option","方法名称 ":"foreach","文件名":"Option.scala","行号":236},{"声明类":"org.apache.spark.scheduler.DAGScheduler","方法名称":"handleTaskSetFailed ","文件名":"DAGScheduler.scala","行号":730},{"声明类":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","方法名称":"onReceive","文件名称":"DAGScheduler.scala","行号":1450},{"声明类":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","方法名称":"onReceive","文件名":" DAGScheduler.scala","行号":1411},{"声明类":"org.apache.spark.util.EventLoop $$ anon $ 1","方法名称":"运行","文件名":" EventLoop.scala","行号":48}]}}}

小智 8

根据我的理解,最常见的 ExecutorLostFailure 原因是执行程序中的 OOM。

为了解决 OOM 问题,需要弄清楚究竟是什么导致了它。简单地增加默认并行度或增加执行器内存并不是一个战略性的解决方案。

如果你看看增加并行性的作用是它试图创建更多的执行程序,以便每个执行程序可以处理越来越少的数据。但是,如果您的数据倾斜,以至于发生数据分区的键(对于并行性)具有更多数据,那么简单地增加并行性将无效。

同样,仅通过增加 Executor 内存将是处理这种情况的一种非常低效的方式,就好像只有一个 executor 因 ExecutorLostFailure 失败一样,为所有 executor 请求增加内存将使您的应用程序需要比实际预期更多的内存。

  • 但如何*找出到底是什么原因*?解决办法在哪里? (2认同)

小智 6

发生此错误是因为任务失败四次以上。尝试使用以下参数增加集群中的并行度。

--conf "spark.default.parallelism=100" 
Run Code Online (Sandbox Code Playgroud)

将并行度值设置为集群上可用内核数的 2 到 3 倍。如果那不起作用。尝试以指数方式增加并行度。即,如果您当前的并行性不起作用,则将其乘以 2,依此类推。我还观察到,如果您的并行度是质数,尤其是在使用 groupByKkey 时,这会有所帮助。


Arn*_*-Oz 2

如果没有失败的执行程序而不是驱动程序的日志,很难说问题是什么,但很可能是内存问题。尝试显着增加分区数量(如果当前是 32,请尝试 2​​00)