Spark 失败，因为“从关闭挂钩调用 stop()”

mas*_*-g3 7 emr apache-spark pyspark spark-dataframe

I'm having the following problem when running Spark on AWS EMR. While doing a join on a table to filter certain IDs out, Spark suddenly dies with the stdout file reporting the following:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.

Run Code Online (Sandbox Code Playgroud)

The command I'm executing where it's breaking is running fine when I run it on local mode on my machine (on a much smaller dataset), and looks like the following:

sampled_data = data_df \
                .join(sample_ids, data_df.entity_id == sample_ids.entity_id, 'inner') \
                .drop(data_df.entity_id) \
                .where(data_df.subcategory == 'main') \
                .select(['entity_id', 'date', 'hour', 'pageno', 'position']) \
                .dropDuplicates()

print 'sampled_data test...'
sampled_data.take(3)

Run Code Online (Sandbox Code Playgroud)

The complete error log (stderr) can be found here: http://pastebin.com/cUrPUQcX. I've gone over it a couple of times but can't find any issue, just that suddenly this occurs, with not much information on why:

16/07/20 21:47:08 INFO SparkContext: Invoking stop() from shutdown hook

Run Code Online (Sandbox Code Playgroud)

Also, if I check the executor's log on the WebUI I see the following:

[...]
16/07/22 15:34:43 INFO s3n.S3NativeFileSystem: Opening 's3n://path/2016-07-01/data_2016-07-01T12-03-35_node7.csv' for reading
16/07/22 15:34:43 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/07/22 15:34:43 INFO storage.MemoryStore: MemoryStore cleared
16/07/22 15:34:43 INFO storage.BlockManager: BlockManager stopped
16/07/22 15:34:43 INFO s3n.S3NativeFileSystem: Opening 's3n://path/2016-07-01/data_2016-07-01T12-03-36_node5.csv' for reading
16/07/22 15:34:43 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/22 15:34:43 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/22 15:34:43 WARN executor.CoarseGrainedExecutorBackend: An unknown (ip-172-31-22-115.us-west-2.compute.internal:32836) driver disconnected.
16/07/22 15:34:43 ERROR executor.CoarseGrainedExecutorBackend: Driver 172.31.22.115:32836 disassociated! Shutting down.
16/07/22 15:34:43 INFO util.ShutdownHookManager: Shutdown hook called
16/07/22 15:34:43 INFO codegen.GenerateMutableProjection: Code generated in 23.143313 ms
16/07/22 15:34:43 INFO util.ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1469108763595_0005/spark-f9a9e3ba-1761-49d0-84b0-8711f1ca71f0

Run Code Online (Sandbox Code Playgroud)

I'm also initializing the cluster with "spark.executor.memory": "10G", and 5 executors.

Any advice would be appreciated.

归档时间：	9 年，7 月前
查看次数：	6758 次
最近记录：	9 年，7 月前