men*_*h84 1 java oozie apache-spark pyspark cloudera-quickstart-vm
我在Oozie中遇到过几个SparkAction作业的例子,其中大多数都是用Java编写的.我编辑了一下并在Cloudera CDH Quickstart 5.4.0(使用Spark版本1.4.0)中运行该示例.
workflow.xml
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark"/>
</prepare>
<master>${master}</master>
<mode>${mode}</mode>
<name>Spark-FileCopy</name>
<class>org.apache.oozie.example.SparkFileCopy</class>
<jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
<arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
<arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark</arg>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
Run Code Online (Sandbox Code Playgroud)
job.properties
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=quickstart.cloudera:8032
master=local[2]
mode=client
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark
Run Code Online (Sandbox Code Playgroud)
Oozie工作流示例(在Java中)能够完成并完成其任务.
不过我spark-submit用Python/PySpark 编写了一份工作.我试着去除<class>罐子
<jar>my_pyspark_job.py</jar>
Run Code Online (Sandbox Code Playgroud)
但是当我尝试运行Oozie-Spark作业时,我在日志中出错:
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]
Run Code Online (Sandbox Code Playgroud)
我想知道如果我使用Python/PySpark,我应该放置什么<class>和<jar>标签?
我在oozie的火花动作中也经常挣扎.我正确设置了sharelib并尝试使用<spark-opts> </spark-opts>标签中的--jars选项传递适当的jar ,但无济于事.
我总是得到一些错误或另一个.我能做的最多就是通过spark-action在本地模式下运行所有java/python spark作业.
但是,我使用shell操作在所有执行模式下运行oozie中的所有spark作业.shell操作的主要问题是shell作业被部署为'yarn'用户.如果您碰巧从纱线以外的用户帐户部署oozie spark作业,则最终会出现Permission Denied错误(因为用户无法访问复制到/user/yarn/.SparkStaging中的spark程序集jar目录).解决此问题的方法是将HADOOP_USER_NAME环境变量设置为用于部署oozie工作流的用户帐户名.
以下是说明此配置的工作流程.我从ambari-qa用户部署我的oozie工作流程.
<workflow-app xmlns="uri:oozie:workflow:0.4" name="sparkjob">
<start to="spark-shell-node"/>
<action name="spark-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>launcher2</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/ambari-qa/sparkActionPython/hive-site.xml</value>
</property>
</configuration>
<exec>/usr/hdp/current/spark-client/bin/spark-submit</exec>
<argument>--master</argument>
<argument>yarn-cluster</argument>
<argument>wordcount.py</argument>
<env-var>HADOOP_USER_NAME=ambari-qa</env-var>
<file>/user/ambari-qa/sparkActionPython/wordcount.py#wordcount.py</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="spark-fail"/>
</action>
<kill name="spark-fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助!
| 归档时间: |
|
| 查看次数: |
6044 次 |
| 最近记录: |