小编jon*_*han的帖子

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.

In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).

In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).

On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:

import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv'))) …
Run Code Online (Sandbox Code Playgroud)

python hdfs dask dask-distributed

5
推荐指数
1
解决办法
2509
查看次数

标签 统计

dask ×1

dask-distributed ×1

hdfs ×1

python ×1