如何合并多个 .h5 文件?

ktt*_*_11 3 hdf5 pytables h5py hdf

在线提供的所有内容都太复杂了。我的数据库很大,我分部分导出。我现在有三个 .h5 文件,我想将它们组合成一个 .h5 文件以供进一步工作。我该怎么做?

kcw*_*w78 6

These examples show how to use h5py to copy datasets between 2 HDF5 files. See my other answer for PyTables examples. I created some simple HDF5 files to mimic CSV type data (all floats, but the process is the same if you have mixed data types). Based on your description, each file only has one dataset. When you have multiple datasets, you can extend this process with visititems() in h5py.

Note: code to create the HDF5 files used in the examples is at the end.

All methods use glob() to find the HDF5 files used in the operations below.

Method 1: Create External Links
This results in 3 Groups in the new HDF5 file, each with an external link to the original data. This does not copy the data, but provides access to the data in all files via the links in 1 file.

with h5py.File('table_links.h5',mode='w') as h5fw:
    link_cnt = 0 
    for h5name in glob.glob('file*.h5'):
        link_cnt += 1
        h5fw['link'+str(link_cnt)] = h5py.ExternalLink(h5name,'/')   
Run Code Online (Sandbox Code Playgroud)

Method 2a: Copy Data 'as-is'
(26-May-2020 update: This uses the .copy() method for all datasets.)
This copies the data from each dataset in the original file to the new file using the original dataset names. It loops to copy ALL root level datasets. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        for obj in h5fr.keys():        
            h5r.copy(obj, h5fw)       
Run Code Online (Sandbox Code Playgroud)

Method 2b: Copy Data 'as-is'
(This was my original answer, before I knew about the .copy() method.)
This copies the data from each dataset in the original file to the new file using the original dataset name. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.create_dataset(dset1,data=arr_data)   
Run Code Online (Sandbox Code Playgroud)

Method 3a: Merge all data into 1 Fixed size Dataset
This copies and merges the data from each dataset in the original file into a single dataset in the new file. In this example there are no restrictions on the dataset names. Also, I initially create a large dataset and don't resize. This assumes there are enough rows to hold all merged data. Tests should be added in production work.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.require_dataset('alldata', dtype="f",  shape=(50,5), maxshape=(100, 5) )
        h5fw['alldata'][row1:row1+arr_data.shape[0],:] = arr_data[:]
        row1 += arr_data.shape[0]
Run Code Online (Sandbox Code Playgroud)

Method 3b: Merge all data into 1 Resizeable Dataset
This is similar to method above. However, I create a resizeable dataset and enlarge based on the amount of data that is read and added.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        dslen = arr_data.shape[0]
        cols = arr_data.shape[1]
        if row1 == 0: 
            h5fw.create_dataset('alldata', dtype="f",  shape=(dslen,cols), maxshape=(None, cols) )
        if row1+dslen <= len(h5fw['alldata']) :
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        else :
            h5fw['alldata'].resize( (row1+dslen, cols) )
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        row1 += dslen
Run Code Online (Sandbox Code Playgroud)

To create the source files read above:

for fcnt in range(1,4,1):
    fname = 'file' + str(fcnt) + '.h5'
    arr = np.random.random(50).reshape(10,5)
    with h5py.File(fname,'w') as h5fw :
        h5fw.create_dataset('data_'+str(fcnt),data=arr)
Run Code Online (Sandbox Code Playgroud)