我使用此文件建议http://spark.apache.org/docs/1.1.1/submitting-applications.html
spsark版本1.1.0
./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \
/home/hadoop/loganalysis/ship-test.py
Run Code Online (Sandbox Code Playgroud)
和代码中的conf:
conf = (SparkConf()
.setMaster("yarn-client")
.setAppName("LogAnalysis")
.set("spark.executor.memory", "1g")
.set("spark.executor.cores", "4")
.set("spark.executor.num", "2")
.set("spark.driver.memory", "4g")
.set("spark.kryoserializer.buffer.mb", "128"))
Run Code Online (Sandbox Code Playgroud)
和slave节点抱怨ImportError
14/12/25 05:09:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length
return self.loads(obj)
ImportError: No module named parser
Run Code Online (Sandbox Code Playgroud)
和parser-src.zip在本地测试.
[hadoop@ip-172-31-10-231 ~]$ python
Python 2.7.8 (default, Nov 3 2014, 10:17:30)
[GCC 4.8.2 20140120 …
Run Code Online (Sandbox Code Playgroud) 我想为我的代码安装模块'mutagen'和'gTTS',但我想拥有它,所以它会在没有它们的每台计算机上安装模块,但如果没有它们,它将不会尝试安装它们.他们已经安装好了.我目前有:
def install(package):
pip.main(['install', package])
install('mutagen')
install('gTTS')
from gtts import gTTS
from mutagen.mp3 import MP3
Run Code Online (Sandbox Code Playgroud)
但是,如果您已经拥有这些模块,那么只要您打开它,就会在程序启动时添加不必要的混乱.