SparkContext 意外关闭的原因是什么?

vae*_*r-k 7 hadoop-yarn apache-spark pyspark apache-spark-ml

我有一个由 2,818,615 行 388 长度pyspark.ml.linalg.SparseVector和一个类标签组成的数据框。我想使用 pyspark mlRandomForestClassifier使用此数据集。每次我尝试训练模型时,spark 都会运行大约 30 分钟,然后会因为sparkContext关闭而失败。如果我将数据集的大小限制为仅 25K 行,则模型可以成功训练,但我需要使用更大的数据集。

这里可能有哪些故障排除步骤?

print(df.rdd.getNumPartitions())   
8

df.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    2|
|(388,[1,355,361,3...|    2|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
+--------------------+-----+
only showing top 20 rows
Run Code Online (Sandbox Code Playgroud)

我的硬件:

  • Workers:4 个 vCPU、30.5 GiB 内存、4 个实例
  • 主控:8 个 vCPU、16 GiB 内存

以下是我(尝试)训练模型的方法:

rf = RandomForestClassifier(featuresCol='features', labelCol='label')
grid = ParamGridBuilder().addGrid(rf.numTrees, [30, 50, 75]).addGrid(rf.maxDepth, [10, 20]).build()
evaluator = MulticlassClassificationEvaluator(metricName="f1")
cv = SparkCV(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator, numFolds=3)
cvModel = cv.fit(df)
Run Code Online (Sandbox Code Playgroud)

回溯声称作业失败,因为:

py4j.protocol.Py4JJavaError: An error occurred while calling o417.fit.
: org.apache.spark.SparkException: Job 76 cancelled because SparkContext was shut down
Run Code Online (Sandbox Code Playgroud)

以下是 Spark 日志的最后几行:

17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 31.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 13 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 12 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 12.
17/11/07 23:16:21 INFO YarnAllocator: Driver requested a total number of 9 executor(s).
17/11/07 23:16:21 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 30, 18, 19.
17/11/07 23:20:07 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/11/07 23:20:07 INFO ApplicationMaster: Final app status: UNDEFINED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/11/07 23:20:07 INFO ShutdownHookManager: Shutdown hook called
Run Code Online (Sandbox Code Playgroud)