Nor*_*sen 16 python apache-spark pyspark
我正在推出一个pyspark计划:
$ export SPARK_HOME=
$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip
$ python
Run Code Online (Sandbox Code Playgroud)
和py代码:
from pyspark import SparkContext, SparkConf
SparkConf().setAppName("Example").setMaster("local[2]")
sc = SparkContext(conf=conf)
Run Code Online (Sandbox Code Playgroud)
如何添加jar依赖项,例如Databricks csv jar?使用命令行,我可以像这样添加包:
$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0
Run Code Online (Sandbox Code Playgroud)
但我没有使用任何这些.该程序是一个更大的工作流程的一部分,没有使用spark-submit我应该能够运行我的./foo.py程序,它应该工作.
Bri*_*lie 22
这里有很多方法(设置ENV vars,添加到$ SPARK_HOME/conf/spark-defaults.conf等等).一些答案已经涵盖了这些.我想为那些专门使用Jupyter笔记本并在笔记本中创建Spark会话的人添加一个额外的答案.这是最适合我的解决方案(在我的情况下,我希望加载Kafka包):
spark = SparkSession.builder.appName('my_awesome')\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\
.getOrCreate()
Run Code Online (Sandbox Code Playgroud)
使用这行代码我不需要做任何其他事情(没有ENVs或conf文件更改).
zer*_*323 11
任何依赖项都可以使用spark.jars.packages(设置也spark.jars应该工作)属性传递$SPARK_HOME/conf/spark-defaults.conf.它应该是逗号分隔的坐标列表.
并且必须在启动JVM之前设置包或类路径属性,这在SparkConf初始化期间发生.这意味着此SparkConf.set方法不能使用.
替代方法是PYSPARK_SUBMIT_ARGS在SparkConf初始化对象之前设置环境变量:
import os
from pyspark import SparkConf
SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
Run Code Online (Sandbox Code Playgroud)
I encountered a similar issue for a different jar ("MongoDB Connector for Spark", mongo-spark-connector), but the big caveat was that I installed Spark via pyspark in conda (conda install pyspark). Therefore, all the assistance for Spark-specific answers weren't exactly helpful. For those of you installing with conda, here is the process that I cobbled together:
1) Find where your pyspark/jars are located. Mine were in this path: ~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars.
2) Download the jar file into the path found in step 1, from this location.
3) Now you should be able to run something like this (code taken from MongoDB official tutorial, using Briford Wylie's answer above):
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
Run Code Online (Sandbox Code Playgroud)
Disclaimers:
1) I don't know if this answer is the right place/SO question to put this; please advise of a better place and I will move it.
2) If you think I have errored or have improvements to the process above, please comment and I will revise.
| 归档时间: |
|
| 查看次数: |
19145 次 |
| 最近记录: |