I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv'))) …Run Code Online (Sandbox Code Playgroud)