r_g*_*_s_ 5 pex apache-spark pyspark databricks spark-submit
我尝试使用下面的 Spark-submit 参数(依赖项位于 PEX 文件上)通过 Databricks UI(使用 Spark-submit)创建 PySpark 作业,但出现 PEX 文件不存在的异常。据我了解, --files 选项将文件放入驱动程序和每个执行程序的工作目录中,所以我很困惑为什么会遇到这个问题。
配置
[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]
Run Code Online (Sandbox Code Playgroud)
标准误
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
Run Code Online (Sandbox Code Playgroud)
我尝试过的
鉴于PEX文件似乎不可见,我尝试通过以下方式添加它:
注意:鉴于本页底部的说明,我相信 PEX 应该可以在 Databricks 上运行;我只是不确定正确的配置:https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependency-in-pyspark.html
另请注意,以下 Spark 提交命令适用于 AWS EMR:
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
"spark-submit",
"--deploy-mode", "cluster",
"--master", "yarn",
"--files", "s3://some_path/my_pex.pex",
"--conf", "spark.pyspark.driver.python=./my_pex.pex",
"--conf", "spark.executorEnv.PEX_ROOT=./tmp",
"--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
"s3://some_path/main.py",
"--some_arg", "some-val"
]
Run Code Online (Sandbox Code Playgroud)
任何帮助将不胜感激,谢谢。
| 归档时间: |
|
| 查看次数: |
413 次 |
| 最近记录: |