我们使用 Spark 2.4 处理大约 445 GB 的数据。我们的集群有 150 个工人,每个工人有 7 个 CPU 和 127 GB。Spark 以独立模式部署。下面是我们的配置:每个 worker 一个 executor,分配了 7 个 CPU 和 120 GB。RDD 中有 2000 个分区。
我看到有时由于执行人丢失而导致工作失败。以下是错误:
驱动日志:
ExecutorLostFailure (executor 82 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\nDriver
Run Code Online (Sandbox Code Playgroud)
执行者日志:
2020-07-03 01:53:10 INFO Worker:54 - Executor app-20200702155258-0011/13 finished with state EXITED message Command exited with code 137 exitStatus …Run Code Online (Sandbox Code Playgroud) apache-spark ×1