读取行的最佳HDF5数据集块形状

Question

读取行的最佳HDF5数据集块形状

jpp*_*jpp 2 python performance hdf5 dataset h5py

我有一个合理的大小（压缩后的18GB）HDF5数据集，并希望优化读取行的速度。形状为（639038，10000）。我将多次读取整个数据集中的选定行（例如〜1000行）。所以我不能使用x：（x + 1000）来切片行。

使用h5py从内存不足的HDF5中读取行已经很慢，因为我必须传递一个排序列表并求助于高级索引。有没有一种方法可以避免花式索引，或者我可以使用更好的块形状/大小？

我已经阅读了一些经验法则，例如1MB-10MB的块大小，并且选择了与我所读内容一致的形状。但是，构建大量具有不同块形状的HDF5文件进行测试在计算上非常昂贵且非常缓慢。

对于每个〜1,000行的选择，我立即将它们求和以获得长度10,000的数组。我当前的数据集如下所示：

'10000': {'chunks': (64, 1000),
          'compression': 'lzf',
          'compression_opts': None,
          'dtype': dtype('float32'),
          'fillvalue': 0.0,
          'maxshape': (None, 10000),
          'shape': (639038, 10000),
          'shuffle': False,
          'size': 2095412704}

Run Code Online (Sandbox Code Playgroud)

我已经尝试过的东西：

用块形状（128，10000）重写数据集（据我估计约为5MB）太慢了。
我看了dask.array进行了优化，但是由于〜1,000行很容易容纳在内存中，所以我看不到任何好处。

Answer 1

max*_*111 7

找到正确的块缓存大小

首先，我不想讨论一些一般性的事情。知道每个单独的块只能整体读取或写入非常重要。默认情况下，可以避免过多的磁盘I / O的h5py的标准块高速缓存大小仅为默认值1 MB，并且在许多情况下应该增加该大小，稍后将对此进行讨论。

举个例子：

我们有一个形状为（639038，10000），float32（未压缩的25.5 GB）的dset
我们不会按列写数据dset[:,i]=arr，而按行读数据arr=dset[i,:]
我们为这种类型的工作选择了完全错误的块形状，即（1,10000）

在这种情况下，读取速度不会很差（尽管块大小有点小），因为我们只读取正在使用的数据。但是，当我们在该数据集上书写时会发生什么呢？如果我们访问列，则会写入每个块的一个浮点数。这意味着我们实际上每次迭代都会写入整个数据集（25.5 GB），并每隔一段时间读取一次整个数据集。这是因为如果您修改了块，则必须先读取它（如果未缓存）（我假设此处的块缓存大小低于25.5 GB）。

那么我们在这里可以改善什么呢？在这种情况下，我们必须在写入/读取速度与块缓存使用的内存之间做出折衷。

假设将给出不错的读/写速度：

我们选择（100，1000）的块大小
如果我们不想遍历第一维，则至少需要（1000 * 639038 * 4-> 2,55 GB）高速缓存，以避免如上所述的额外IO开销和（100 * 10000 * 4-> 0.4） MB）。
因此，在此示例中，我们应至少提供2.6 GB的块数据缓存。这可以通过h5py-cache https://pypi.python.org/pypi/h5py-cache/1.0轻松完成

结论通常没有合适的块大小或形状，这在很大程度上取决于要使用的任务。切勿在不考虑块缓存的情况下选择块的大小或形状。就随机读/写而言，RAM比最快的SSD快了几个数量级。

关于您的问题， 我将只读取随机行，不正确的chunk-cache-size是您真正的问题。

将以下代码的性能与您的版本进行比较：

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    shape = (639038, 10000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5c.File(File_Name_HDF5,'r',chunk_cache_mem_size=1024**2*4000)
    d = f["Test"]
    for j in xrange(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in xrange(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()


if __name__ == "__main__":
    ReadingAndWriting()

Run Code Online (Sandbox Code Playgroud)

花式切片的最简单形式

我在评论中写道，我在最近的版本中看不到这种行为。我错了。比较以下内容：

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def Writing():
    File_Name_HDF5='Test.h5'

    shape = (63903, 10000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    # Writing_1 normal indexing
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**3)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=(10000,shape[1]/50))
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array,1)

    f.close()
    print(time.time()-t1)

    # Writing_2 simplest form of fancy indexing 
    f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**3)
    d = f.create_dataset('Test', shape ,dtype='f',chunks=(10000,shape[1]/50))
    t1=time.time()
    for i in xrange(0,shape[1]):
        d[:,i]=Array

    f.close()
    print(time.time()-t1)


if __name__ == "__main__":
    Writing()

Run Code Online (Sandbox Code Playgroud)

对于我的SSD，第一个版本需要10,8秒，第二个版本需要55秒。

归档时间：	8 年前
查看次数：	3105 次
最近记录：	8 年前