提交Apache Spark作业时在spark.jars中使用通配符

Question

提交Apache Spark作业时在spark.jars中使用通配符

我有一组JAR想要提供给存储在HDFS上的Spark作业.

Spark 2.3的文档说这spark.jars是参数:

spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.

但是,设置spark.jars为hdfs:///path/to/my/libs/*.jar失败:驱动程序启动正常,一个阶段被启动,但随后任务死亡:

WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxxx, executor 1): java.io.FileNotFoundException: File hdfs:/path/to/my/libs/*.jar does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:901) at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:724) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:692) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:472) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755) ...

也就是说,它似乎没有在执行器上运行时扩展glob.

明确设置spark.jars为hdfs:///path/to/my/libs/libA.jar,hdfs:///path/to/my/libs/libB.jar正常工作.

如文档所示,我如何使用glob spark.jars？

Answer 1

小智 -1

我正在从本地文件系统运行所有 Spark 批处理和流应用程序。我不确定为什么需要将它们存储在 hdfs 上。

但是如果您更喜欢使用本地文件系统来保存 jar，那么您可以使用通配符，如下所示：-

export BASE_DIR="/local/file/path/where/jar/is/available"

spark-submit \
    --class ${class} \
    --deploy-mode cluster \
    --master yarn \
...
...
...
    --name ${APPLICATION_NAME} \
    ${BASE_DIR}/*.jar

Run Code Online (Sandbox Code Playgroud)

希望这有帮助。

归档时间：	7 年，7 月前
查看次数：	348 次
最近记录：	6 年，5 月前