pit*_*erd 5 python apache-spark pyspark jupyter-notebook azure-hdinsight
我正在尝试在 Spark HDInsight 集群上运行 python wordcount,我正在从 Jupyter 运行它。我实际上不确定这是否是正确的方法,但我找不到任何关于如何在 HDInsight Spark 群集上提交独立 python 应用程序的有用信息。
编码 :
import pyspark
import operator
from pyspark import SparkConf
from pyspark import SparkContext
import atexit
from operator import add
conf = SparkConf().setMaster("yarn-client").setAppName("WC")
sc = SparkContext(conf = conf)
atexit.register(lambda: sc.stop())
input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = input.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (str(x),1)).reduceByKey(add)
wordCount.saveAsTextFile("wasb:///example/outputspark")
Run Code Online (Sandbox Code Playgroud)
我收到但不明白的错误消息:
ValueError Traceback (most recent call last)
<ipython-input-2-8a9d4f2cb5e8> in <module>()
6 from operator import add
7 import atexit
----> 8 sc = SparkContext('yarn-client')
9
10 input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
/usr/hdp/current/spark-client/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
108 """
109 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110 SparkContext._ensure_initialized(self, gateway=gateway)
111 try:
112 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/hdp/current/spark-client/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
248 " created by %s at %s:%s "
249 % (currentAppName, currentMaster,
--> 250 callsite.function, callsite.file, callsite.linenum))
251 else:
252 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=yarn-client) created by __init__ at <ipython-input-1-86beedbc8a46>:7
Run Code Online (Sandbox Code Playgroud)
实际上可以以这种方式运行python作业吗?如果是 - 这似乎是 SparkContext 定义的问题......我尝试了不同的方法:
sc = SparkContext('spark://headnodehost:7077', 'pyspark')
Run Code Online (Sandbox Code Playgroud)
和
conf = SparkConf().setMaster("yarn-client").setAppName("WordCount1")
sc = SparkContext(conf = conf)
Run Code Online (Sandbox Code Playgroud)
但没有成功。运行作业或配置 SparkContext 的正确方法是什么?
看来我自己可以回答我的问题了。代码中的一些更改被证明是有帮助的:
conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("pyspark-word-count6")
sc = SparkContext(conf=conf)
atexit.register(lambda: sc.stop())
data = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = data.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (x.encode('ascii','ignore'),1)).reduceByKey(add)
wordCount.saveAsTextFile("wasb:///output/path")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1414 次 |
| 最近记录: |