ImportError:在spark worker上没有名为numpy的模块

ajk*_*jkl 14 python numpy apache-spark pyspark

在客户端模式下启动pyspark.bin/pyspark --master yarn-client --num-executors 60shell上的import numpy很好但是在kmeans中失败了.不知何故,执行者没有安装numpy是我的感觉.我没有找到任何好的解决方案让工人知道numpy.我尝试设置PYSPARK_PYTHON,但这也没有用.

import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel

from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Run Code Online (Sandbox Code Playgroud)

堆栈跟踪

 org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>

ImportError: No module named numpy

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
        enter code here
Run Code Online (Sandbox Code Playgroud)

day*_*man 18

要在Yarn客户端模式下使用Spark,您需要在Yarn启动执行程序的机器上安装任何依赖项.这是使这项工作唯一可靠的方法.

使用Spark with Yarn集群模式是另一回事.您可以使用spark-submit分发python依赖项.

spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
Run Code Online (Sandbox Code Playgroud)

然而,numpy的情况很复杂,因为它使得它如此之快:在C中繁重的事实.由于它的安装方式,你将无法以这种方式分配numpy.

  • 不幸的是.我的团队和我一直被这个确切的问题所困扰. (5认同)

sin*_*mos 5

numpy 未安装在工作(虚拟)机器上。如果使用anaconda,在集群模式下部署应用时上传此类python依赖是非常方便的。(所以不需要在每台机器上安装 numpy 或其他模块,而是必须在你的 anaconda 中安装)。首先,压缩你的 anaconda 并将 zip 文件放入集群,然后你可以使用以下脚本提交作业。

 spark-submit \
 --master yarn \
 --deploy-mode cluster \
 --archives hdfs://host/path/to/anaconda.zip#python-env
 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python 
 app_main.py
Run Code Online (Sandbox Code Playgroud)

Yarn 会将 anaconda.zip 从 hdfs 路径复制到每个工作线程,并使用 pthon-env/anaconda/bin/python 来执行任务。

请参阅使用 Virtualenv 运行 PySpark可能会提供更多信息。