如何将 numpy 数组从 PySpark worker 保存到 HDFS 或共享文件系统？

Question

如何将 numpy 数组从 PySpark worker 保存到 HDFS 或共享文件系统？

Bik*_*shi 5 hadoop hdfs shared-file apache-spark pyspark

我想在 PySpark 中有效地将 numpy 数组从/到工作机器（函数）保存/读取到 HDFS。我有两台机器 A 和 B。A 有主人和工人。B 有一名工人。例如，我想实现以下目标：

if __name__ == "__main__":
    conf = SparkConf().setMaster("local").setAppName("Test")
    sc = SparkContext(conf = conf)
    sc.parallelize([0,1,2,3], 2).foreachPartition(func)

def func(iterator):
    P = << LOAD from HDFS or Shared Memory as numpy array>>
    for x in iterator:
        P = P + x

    << SAVE P (numpy array) to HDFS/ shared file system >>

Run Code Online (Sandbox Code Playgroud)

什么是快速有效的方法？

Answer 1

Der*_*lin 1

我偶然发现了同样的问题。最后使用HdfsCli 模块和 Python3.4 的临时文件来解决问题。

进口：

from hdfs import InsecureClient
from tempfile import TemporaryFile

Run Code Online (Sandbox Code Playgroud)

创建一个 hdfs 客户端。在大多数情况下，最好在脚本中的某个位置有一个实用函数，如下所示：

def get_hdfs_client():
    return InsecureClient("<your webhdfs uri>", user="<hdfs user>",
         root="<hdfs base path>")

Run Code Online (Sandbox Code Playgroud)

在工作函数中加载和保存 numpy：

hdfs_client = get_hdfs_client()

# load from file.npy
path = "/whatever/hdfs/file.npy"
tf = TemporaryFile()

with hdfs_client.read(path) as reader:
    tf.write(reader.read())
    tf.seek(0) # important, set cursor to beginning of file

np_array = numpy.load(tf)

...

# save to file.npy
tf = TemporaryFile()
numpy.save(tf, np_array)
tf.seek(0) # important ! set the cursor to the beginning of the file
# with overwrite=False, an exception is thrown if the file already exists
hdfs_client.write("/whatever/output/file.npy", tf.read(),  overwrite=True)

Run Code Online (Sandbox Code Playgroud)

笔记：

用于创建 hdfs 客户端的 uri 以开头http://，因为它使用 hdfs 文件系统的 Web 界面；
确保您传递给 hdfs 客户端的用户具有读写权限
根据我的经验，开销并不大（至少在执行时间方面）
使用临时文件（与中的常规文件相比）的优点/tmp是，可以确保脚本结束后集群计算机中不会保留任何垃圾文件，无论正常与否

归档时间：	10 年，3 月前
查看次数：	3644 次
最近记录：	9 年，9 月前