我是Spark的新手,并尝试完成Spark教程: 链接到教程
在本地机器上安装它(Win10 64,Python 3,Spark 2.4.0)并设置所有env变量(HADOOP_HOME,SPARK_HOME等)后,我试图通过WordCount.py文件运行一个简单的Spark作业:
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("word count").setMaster("local[2]")
sc = SparkContext(conf = conf)
lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.countByValue()
for word, count in wordCounts.items():
print("{} : {}".format(word, count))
Run Code Online (Sandbox Code Playgroud)
从终端运行后:
spark-submit WordCount.py
Run Code Online (Sandbox Code Playgroud)
我得到以下错误.我检查了(通过逐行注释)它崩溃了
wordCounts = words.countByValue()
Run Code Online (Sandbox Code Playgroud)
知道我应该检查什么才能使它工作?
Traceback (most recent call last):
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module> …Run Code Online (Sandbox Code Playgroud)