纱线上产生火花，容器退出，退出代码为非零143

Question

纱线上产生火花，容器退出，退出代码为非零143

Dav*_*d H 7 hive hadoop-yarn hortonworks-data-platform apache-spark

我正在使用HDP 2.5，将spark-submit作为纱线簇模式运行。

我试图使用数据框交叉连接生成数据。即

val generatedData = df1.join(df2).join(df3).join(df4)
generatedData.saveAsTable(...)....

Run Code Online (Sandbox Code Playgroud)

df1的存储级别为MEMORY_AND_DISK

df2，df3，df4存储级别为MEMORY_ONLY

df1具有更多记录，即500万条记录，而df2至df4具有最多100条记录。这样，使用BroadcastNestedLoopJoin解释计划，我的解释就会得到更好的性能。

由于某种原因，它总是失败。我不知道如何调试它以及内存在哪里爆炸。

错误日志输出：

16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

Run Code Online (Sandbox Code Playgroud)

在出现此错误之前，我没有看到任何警告或错误日志。问题是什么？我应该在哪里寻找内存消耗？我在SparkUI 的“ 存储”选项卡上看不到任何内容。该日志取自HDP 2.5上的纱线资源管理器UI

编辑查看容器日志，看来这是一个java.lang.OutOfMemoryError: GC overhead limit exceeded

我知道如何增加内存，但是我没有任何内存了。我如何在没有出现此错误的情况下将笛卡尔/乘积与4个数据框合并。

Answer 1

Mat*_*i66 6

我也遇到了这个问题，并尝试通过引用一些博客来解决它。1.运行spark add conf bellow：

--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \
--conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC  ' \

Run Code Online (Sandbox Code Playgroud)

当jvm GC时，您将得到以下消息：

Heap after GC invocations=157 (full 98):
 PSYoungGen      total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000)
  from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000)
  to   space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000)
 ParOldGen       total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000)
  object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000)
 Metaspace       used 43044K, capacity 43310K, committed 44288K, reserved 1087488K
  class space    used 6618K, capacity 6701K, committed 6912K, reserved 1048576K  
}

Run Code Online (Sandbox Code Playgroud)

PSYoungGen和ParOldGen都为99％，那么您将得到java.lang.OutOfMemoryError：如果创建了更多对象，则超出了GC开销限制。
当有更多的内存资源可用时，尝试为执行程序或驱动程序添加更多的内存：

--executor 内存10000m \- 驱动程序内存10000m \

就我而言：PSYoungGen的内存小于ParOldGen，这导致许多年轻对象进入ParOldGen内存区域，而最终ParOldGen不可用。因此java.lang.OutOfMemoryError：Java堆空间错误出现。
为执行程序添加conf：

'spark.executor.extraJavaOptions = -XX：NewRatio = 1 -XX：+ UseCompressedOops-详细：gc -XX：+ PrintGCDetails -XX：+ PrintGCTimeStamps'

-XX：NewRatio = rate rate = ParOldGen / PSYoungGen

这取决于你可以尝试GC策略，例如

-XX:+UseSerialGC :Serial Collector 
-XX:+UseParallelGC :Parallel Collector
-XX:+UseParallelOldGC :Parallel Old collector 
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep

Run Code Online (Sandbox Code Playgroud)

Java并发和并行GC

如果第4步和第6步均已完成，但仍然出错，则应考虑更改代码。例如，减少ML模型中的迭代器时间。

Answer 2

Abh*_*and 5

所有容器和am的日志文件都可用，

yarn logs -applicationId application_1480922439133_0845_02

Run Code Online (Sandbox Code Playgroud)

如果您只想要AM日志，

yarn logs -am -applicationId application_1480922439133_0845_02

Run Code Online (Sandbox Code Playgroud)

如果您要查找为此任务运行的容器，

yarn logs -applicationId application_1480922439133_0845_02|grep container_e33_1480922439133_0845_02

Run Code Online (Sandbox Code Playgroud)

如果您只需要一个容器日志，

yarn logs -containerId container_e33_1480922439133_0845_02_000002

Run Code Online (Sandbox Code Playgroud)

为了使这些命令起作用，必须将日志聚合设置为true，否则您将必须从单个服务器目录中获取日志。

更新除了尝试交换之外，您无能为力，但这会大大降低性能。

GC开销限制意味着，GC已经连续不间断运行，但是无法恢复大量内存。这样做的唯一原因是，要么代码编写不正确，并且具有大量的反向引用（这很可疑，因为您正在执行简单的连接），否则已达到内存容量。

归档时间：	8 年，10 月前
查看次数：	24642 次
最近记录：	8 年，2 月前