为什么在使用 dask 时 zarr 的性能比镶木地板好得多？

Question

为什么在使用 dask 时 zarr 的性能比镶木地板好得多？

Chr*_*ine 5 python dask intake

当我使用 dask 对 zarr 数据和 parquet 数据运行基本相同的计算时，基于 zarr 的计算明显更快。为什么？可能是因为我在创建镶木地板文件时做错了什么？

我在 jupyter notebook 中用虚假数据（见下文）复制了这个问题，以说明我所看到的行为类型。我很感激任何人对为什么基于 zarr 的计算比基于镶木地板的计算快几个数量级的见解。

我在现实生活中使用的数据是地球科学模型数据。特定的数据参数并不重要，但可以将每个参数视为具有纬度、经度和时间维度的数组。

要生成 zarr 文件，我只需写出我的参数及其维度的多维结构。

为了生成镶木地板，我首先将 3-D 参数数组“展平”为一个 1-D 数组，它成为我数据框中的单列。然后我添加纬度、经度和时间列，然后将数据框写为镶木地板。

此单元格具有其余代码所需的所有导入：

import pandas as pd
import numpy as np
import xarray as xr
import dask
import dask.array as da
import intake
from textwrap import dedent

Run Code Online (Sandbox Code Playgroud)

该单元生成假数据文件，总大小超过 3 GB：

def build_data(lat_resolution, lon_resolution, ntimes):
    """Build a fake geographical dataset with ntimes time steps and 
       resolution lat_resolution x lon_resolution"""
    lats = np.linspace(-90.0+lat_resolution/2,
                       90.0-lat_resolution/2,
                       np.round(180/lat_resolution))
    lons = np.linspace(-180.0+lon_resolution/2,
                       180-lon_resolution/2,
                       np.round(360/lon_resolution))
    times = np.arange(start=1,stop=ntimes+1)

    data = np.random.randn(len(lats),len(lons),len(times))
    return lats,lons,times,data

def create_zarr_from_data_set(lats,lons,times,data,zarr_dir):
    """Write zarr from a data set corresponding to the data passed in."""
    dar = xr.DataArray(data,
                       dims=('lat','lon','time'),
                       coords={'lat':lats,'lon':lons,'time':times},
                       name="data")
    ds = xr.Dataset({'data':dar,
                     'lat':('lat',lats),
                     'lon':('lon',lons),
                     'time':('time',times)})
    ds.to_zarr(zarr_dir)

def create_parquet_from_data_frame(lats,lons,times,data,parquet_file):
    """Write a parquet file from a dataframe corresponding to the data passed in."""
    total_points = len(lats)*len(lons)*len(times)

    # Flatten the data array
    data_flat = np.reshape(data,(total_points,1))

    # use meshgrid to create the corresponding latitude, longitude, and time 
    # columns
    mesh = np.meshgrid(lats,lons,times,indexing='ij')
    lats_flat = np.reshape(mesh[0],(total_points,1))
    lons_flat = np.reshape(mesh[1],(total_points,1))
    times_flat = np.reshape(mesh[2],(total_points,1))

    df = pd.DataFrame(data = np.concatenate((lats_flat,
                                             lons_flat,
                                             times_flat, 
                                             data_flat),axis=1), 
                      columns = ["lat","lon","time","data"])
    df.to_parquet(parquet_file,engine="fastparquet")

def create_fake_data_files():
    """Create zarr and parquet files with fake data"""
    zarr_dir = "zarr"
    parquet_file = "data.parquet"

    lats,lons,times,data = build_data(0.1,0.1,31)
    create_zarr_from_data_set(lats,lons,times,data,zarr_dir)
    create_parquet_from_data_frame(lats,lons,times,data,parquet_file)

    with open("data_catalog.yaml",'w') as f:
        catalog_str = dedent("""\
            sources:
              zarr:
                args:
                  urlpath: "./{}"
                description: "data in zarr format"
                driver: intake_xarray.xzarr.ZarrSource
                metadata: {{}}
              parquet:
                args:
                  urlpath: "./{}"
                description: "data in parquet format"
                driver: parquet
        """.format(zarr_dir,parquet_file))
        f.write(catalog_str)


##
# Generate the fake data
##
create_fake_data_files()

Run Code Online (Sandbox Code Playgroud)

我对 parquet 和 zarr 文件运行了几种不同类型的计算，但在本示例中为简单起见，我将仅提取特定时间、纬度和经度的单个参数值。

此单元格构建 zarr 和 parquet 有向无环图 (DAG) 以进行计算：

# pick some arbitrary point to pull out of the data
lat_value = -0.05
lon_value = 10.95
time_value = 5

# open the data
cat = intake.open_catalog("data_catalog.yaml")
data_zarr = cat.zarr.to_dask()
data_df = cat.parquet.to_dask()

# build the DAG for getting a single point out of the zarr data
time_subset = data_zarr.where(data_zarr.time==time_value,drop=True)
lat_condition = da.logical_and(time_subset.lat < lat_value + 1e-9, time_subset.lat > lat_value - 1e-9)
lon_condition = da.logical_and(time_subset.lon < lon_value + 1e-9, time_subset.lon > lon_value - 1e-9)
geo_condition = da.logical_and(lat_condition,lon_condition)
zarr_subset = time_subset.where(geo_condition,drop=True)

# build the DAG for getting a single point out of the parquet data
parquet_subset = data_df[(data_df.lat > lat_value - 1e-9) & 
                         (data_df.lat < lat_value + 1e-9) &
                         (data_df.lon > lon_value - 1e-9) & 
                         (data_df.lon < lon_value + 1e-9) &
                         (data_df.time == time_value)]

Run Code Online (Sandbox Code Playgroud)

当我针对每个 DAG 的计算运行时间时，我得到了截然不同的时间。基于 zarr 的子集需要不到一秒钟的时间。基于镶木地板的子集需要 15-30 秒。

此单元格执行基于 zarr 的计算：

%%time
zarr_point = zarr_subset.compute()

Run Code Online (Sandbox Code Playgroud)

基于 Zarr 的计算时间：

CPU times: user 6.19 ms, sys: 5.49 ms, total: 11.7 ms
Wall time: 12.8 ms

Run Code Online (Sandbox Code Playgroud)

此单元格执行基于镶木地板的计算：

CPU times: user 6.19 ms, sys: 5.49 ms, total: 11.7 ms
Wall time: 12.8 ms

Run Code Online (Sandbox Code Playgroud)

基于 Parquet 的计算时间：

CPU times: user 18.2 s, sys: 28.1 s, total: 46.2 s
Wall time: 29.3 s

Run Code Online (Sandbox Code Playgroud)

如您所见，基于 zarr 的计算要快得多。为什么？

Answer 1

mdu*_*ant 5

很高兴看到fastparquet，zarr并intake用于同一问题！

TL;DR 是：使用适合您的任务的正确数据模型。

另外，值得指出的是，zarr 数据集为 1.5GB，blosc/lz4 压缩为 512 个块，parquet 数据集 1.8GB，snappy 压缩为 5 个块，其中压缩都是默认值。随机数据不能很好地压缩，坐标可以。

zarr 是一种面向数组的格式，可以在任何维度上分块，这意味着，要读取单个点，您只需要元数据（这是非常简短的文本）和包含它的一个块 - 这需要在这种情况下未压缩。数据块的索引是隐式的。

parquet 是一种面向列的格式。要找到特定点，您可以根据每个块的最小/最大列元数据忽略一些块，具体取决于坐标列的组织方式，然后加载随机数据的列块并解压缩。您需要自定义逻辑才能选择块以同时加载到多个列上，Dask 当前未实现（如果不仔细重新排序数据，则不可能实现）。parquet 的元数据比 zarr 大得多，但在这种情况下两者都无关紧要 - 如果您有许多变量或更多坐标，这可能成为 parquet 的一个额外问题。

在这种情况下，zarr 的随机访问速度会快得多，但读取所有数据并没有根本不同，因为两者都必须加载磁盘上的所有字节并解压缩为浮点数，并且在这两种情况下，坐标数据加载速度都很快。然而，未压缩数据帧的内存表示比未压缩数组大得多，因为每个坐标的数组不是一维小数组，而是每个坐标的数组，其点数与随机数据相同；另外，通过对小数组进行索引以获得数组案例中的正确坐标，并通过与数据帧案例中每个点的每个纬度/经度值进行比较，来找到特定点。

归档时间：	6 年，3 月前
查看次数：	1166 次
最近记录：	6 年，3 月前