Bik*_*shi 5 hadoop hdfs shared-file apache-spark pyspark
我想在 PySpark 中有效地将 numpy 数组从/到工作机器(函数)保存/读取到 HDFS。我有两台机器 A 和 B。A 有主人和工人。B 有一名工人。例如,我想实现以下目标:
if __name__ == "__main__":
conf = SparkConf().setMaster("local").setAppName("Test")
sc = SparkContext(conf = conf)
sc.parallelize([0,1,2,3], 2).foreachPartition(func)
def func(iterator):
P = << LOAD from HDFS or Shared Memory as numpy array>>
for x in iterator:
P = P + x
<< SAVE P (numpy array) to HDFS/ shared file system >>
Run Code Online (Sandbox Code Playgroud)
什么是快速有效的方法?
我偶然发现了同样的问题。最后使用HdfsCli 模块和 Python3.4 的临时文件来解决问题。
from hdfs import InsecureClient
from tempfile import TemporaryFile
Run Code Online (Sandbox Code Playgroud)
def get_hdfs_client():
return InsecureClient("<your webhdfs uri>", user="<hdfs user>",
root="<hdfs base path>")
Run Code Online (Sandbox Code Playgroud)
hdfs_client = get_hdfs_client()
# load from file.npy
path = "/whatever/hdfs/file.npy"
tf = TemporaryFile()
with hdfs_client.read(path) as reader:
tf.write(reader.read())
tf.seek(0) # important, set cursor to beginning of file
np_array = numpy.load(tf)
...
# save to file.npy
tf = TemporaryFile()
numpy.save(tf, np_array)
tf.seek(0) # important ! set the cursor to the beginning of the file
# with overwrite=False, an exception is thrown if the file already exists
hdfs_client.write("/whatever/output/file.npy", tf.read(), overwrite=True)
Run Code Online (Sandbox Code Playgroud)
笔记:
http://,因为它使用 hdfs 文件系统的 Web 界面;/tmp是,可以确保脚本结束后集群计算机中不会保留任何垃圾文件,无论正常与否| 归档时间: |
|
| 查看次数: |
3644 次 |
| 最近记录: |