随机播放文件丢失

Kir*_*rst 8 hadoop-yarn apache-spark apache-spark-1.5.2

我得到的随机播放文件实例没有用Spark编写.

15/12/29 17:30:26 ERROR server.TransportRequestHandler: Error sending result 
    ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=347837678000, chunkIndex=0}, 
    buffer=FileSegmentManagedBuffer{file=/data/24/hadoop/yarn/local/usercache/root/appcache
    /application_1451416375261_0032/blockmgr-c2e951bb-856d-487f-a5be-2b3194fdfba6/1a/
    shuffle_0_35_0.data, offset=1088736267, length=8082368}} 
    to /10.7.230.74:42318; closing connection
java.io.FileNotFoundException: 
    /data/24/hadoop/yarn/local/usercache/root/appcache/application_1451416375261_0032/
    blockmgr-c2e951bb-856d-487f-a5be-2b3194fdfba6/1a/shuffle_0_35_0.data 
    (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    ...
Run Code Online (Sandbox Code Playgroud)

似乎大多数shuffle文件都是成功写入的,而不是全部.

这是洗牌阶段 - 或'阅读洗牌文件'阶段.

起初,所有执行程序都能够读取文件.最终,并且不可避免地,其中一个执行程序抛出上述异常并被删除.所有其他人都开始失败,因为他们无法检索那些随机文件.

在此输入图像描述

我在每个执行器中都有40GB的RAM,而且我有8个执行器.额外的一个是这个列表是因为失败后删除了执行程序.我的数据很大,但我没有看到任何out of memory问题.

有什么想法吗?


将我的repartition呼叫从1000个分区更改为100000个分区,现在我正在获得新的堆栈跟踪.

Job aborted due to stage failure: Task 71 in stage 9.0 failed 4 times, most recent failure: Lost task 71.3 in stage 9.0 (TID 2831, dev-node1): java.io.IOException: FAILED_TO_UNCOMPRESS(5)
    at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
    at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
    at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
    at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
    at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
    at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
    at org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58)
    at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
    at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1179)
    at org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:53)
    at org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:52)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
    at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
    at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
    at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
    at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
    ...
Run Code Online (Sandbox Code Playgroud)