从大文件中读取而不使用h5py将整个内容加载到内存中

sup*_*ind 3 python hdf5 h5py

以下是否从数据集中读取而不将整个事物一次性加载到内存中[整个事物将不适合内存]并获取数据集的大小而不使用python中的h5py加载数据?如果没有,怎么样?

h5 = h5py.File('myfile.h5', 'r')
mydata = h5.get('matirx') # are all data loaded into memory by using h5.get?
part_of_mydata= mydata[1000:11000,:]
size_data =  mydata.shape 
Run Code Online (Sandbox Code Playgroud)

谢谢.

hpa*_*ulj 5

get (或索引)获取文件上数据集的引用,但不加载任何数据.

In [789]: list(f.keys())
Out[789]: ['dset', 'dset1', 'vset']
In [790]: d=f['dset1']
In [791]: d
Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8">
In [792]: d.shape         # shape of dataset
Out[792]: (2, 3, 10)
In [793]: arr=d[:,:,:5]    # indexing the set fetches part of the data
In [794]: arr.shape
Out[794]: (2, 3, 5)
In [795]: type(d)
Out[795]: h5py._hl.dataset.Dataset
In [796]: type(arr)
Out[796]: numpy.ndarray
Run Code Online (Sandbox Code Playgroud)

d数据集是数组,但实际上不是numpy数组.

获取整个数据集:

In [798]: arr = d[:]
In [799]: type(arr)
Out[799]: numpy.ndarray
Run Code Online (Sandbox Code Playgroud)

它必须阅读以获取你的文件的具体方式取决于切片,数据布局,分块以及其他通常不受你控制的事情,并且不应该担心你.

另请注意,在读取一个数据集时,我没有加载其他数据集.同样适用于群组.

http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data