小编San*_*man的帖子

属性spark.yarn.jars - 如何处理它?

我对Spark的了解是有限的,你在阅读这个问题后会感觉到它.我只有一个节点和火花,hadoop和纱线安装在它上面.

我能够通过下面的命令在集群模式下编码和运行字数统计问题

 spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner 
              --master yarn 
              --deploy-mode cluster
              --driver-memory=2g
              --executor-memory 2g
              --executor-cores 1
              --num-executors 1
              SparkSimple-0.0.1SNAPSHOT.jar                                 
              hdfs://sanjeevd.br:9000/user/spark-test/word-count/input
              hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
Run Code Online (Sandbox Code Playgroud)

它工作得很好.

现在我明白了"火花上的火花"需要集群上可用的火花罐文件,如果我什么都不做,那么每次运行我的程序时,它都会将数百个jar文件从$ SPARK_HOME复制到每个节点(在我看来是这样的)只有一个节点).我看到代码的执行在完成复制之前暂停了一段时间.见下文 -

16/12/12 17:24:03 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/12 17:24:06 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_libs__11112433502351931.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_libs__11112433502351931.zip
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_conf__6716604236006329155.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_conf__.zip
Run Code Online (Sandbox Code Playgroud)

Spark的文档建议设置spark.yarn.jars属性以避免这种复制.所以我在spark-defaults.conf文件下面的属性下面设置.

spark.yarn.jars hdfs://sanjeevd.br:9000//user/spark/share/lib
Run Code Online (Sandbox Code Playgroud)

http://spark.apache.org/docs/latest/running-on-yarn.html#preparations 要从YARN端访问Spark运行时jar,可以指定spark.yarn.archive或spark.yarn.jars.有关详细信息,请参阅Spark属性.如果既未指定spark.yarn.archive也未指定spark.yarn.jars,Spark将创建一个包含$ SPARK_HOME/jars下所有jar的zip文件,并将其上传到分布式缓存.

顺便说一句,我有从LOCAL /opt/spark/jars到HDFS的所有jar文件 …

apache-spark

17
推荐指数
3
解决办法
2万
查看次数

标签 统计

apache-spark ×1