小编Ale*_*del的帖子

"sparkContext被关闭",同时在大型数据集上运行spark

当在群集上运行sparkJob超过某个数据大小(~2,5GB)时,我得到"因为SparkContext被关闭而取消了作业"或"执行者丢失".看着纱桂,我看到被杀的工作是成功的.运行500mb的数据时没有问题.我正在寻找一个解决方案,并发现:"似乎纱线杀死了一些执行者,因为他们要求的内存超出预期."

有什么建议怎么调试呢?

命令我提交我的火花作业:

/opt/spark-1.5.0-bin-hadoop2.4/bin/spark-submit  --driver-memory 22g --driver-cores 4 --num-executors 15 --executor-memory 6g --executor-cores 6  --class sparkTesting.Runner   --master yarn-client myJar.jar jarArguments
Run Code Online (Sandbox Code Playgroud)

和sparkContext设置

val sparkConf = (new SparkConf()
    .set("spark.driver.maxResultSize", "21g")
    .set("spark.akka.frameSize", "2011")
    .set("spark.eventLog.enabled", "true")
    .set("spark.eventLog.enabled", "true")
    .set("spark.eventLog.dir", configVar.sparkLogDir)
    )
Run Code Online (Sandbox Code Playgroud)

失败的简化代码看起来像那样

 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val broadcastParser = sc.broadcast(new Parser())

val featuresRdd = hc.sql("select "+ configVar.columnName + " from " + configVar.Table +" ORDER BY RAND() LIMIT " + configVar.Articles)
val myRdd : org.apache.spark.rdd.RDD[String] = featuresRdd.map(doSomething(_,broadcastParser))

val allWords= featuresRdd
  .flatMap(line => line.split(" …
Run Code Online (Sandbox Code Playgroud)

scala hadoop-yarn apache-spark apache-spark-sql

9
推荐指数
2
解决办法
2万
查看次数