为什么pickle + gzip在重复数据集上的表现优于h5py?

ale*_*lex 2 python gzip numpy pickle h5py

我正在保存一个包含重复数据的numpy数组:

import numpy as np
import gzip
import cPickle as pkl
import h5py

a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )

f_pkl_gz = gzip.open('noise.pkl.gz', 'w')
pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl_gz.close()

f_pkl = open('noise.pkl', 'w')
pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl.close()

f_hdf5 = h5py.File('noise.hdf5', 'w')
f_hdf5.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9)
f_hdf5.close()
Run Code Online (Sandbox Code Playgroud)

现在列出结果

-rw-rw-r--. 1 alex alex 76962165 Oct  7 20:51 noise.hdf5
-rw-rw-r--. 1 alex alex 79992937 Oct  7 20:51 noise.pkl
-rw-rw-r--. 1 alex alex  8330136 Oct  7 20:51 noise.pkl.gz
Run Code Online (Sandbox Code Playgroud)

因此,具有最高压缩的hdf5所需的空间大约与原始泡菜相同,几乎是喷射泡菜的10倍.

有谁知道为什么会这样?我能做些什么呢?

ale*_*lex 5

答案是按照@tcaswell的建议使用块.我想压缩是在每个块上单独执行的,并且块的默认大小很小,因此压缩数据中没有足够的冗余来从中受益.

这是给出一个想法的代码:

import numpy as np
import gzip
import cPickle as pkl
import h5py

a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )

f_hdf5_chunk_1 = h5py.File('noise_chunk_1.hdf5', 'w')
f_hdf5_chunk_1.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1,100))
f_hdf5_chunk_1.close()

f_hdf5_chunk_10 = h5py.File('noise_chunk_10.hdf5', 'w')
f_hdf5_chunk_10.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10,100))
f_hdf5_chunk_10.close()

f_hdf5_chunk_100 = h5py.File('noise_chunk_100.hdf5', 'w')
f_hdf5_chunk_100.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (100,100))
f_hdf5_chunk_100.close()

f_hdf5_chunk_1000 = h5py.File('noise_chunk_1000.hdf5', 'w')
f_hdf5_chunk_1000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1000,100))
f_hdf5_chunk_1000.close()

f_hdf5_chunk_10000 = h5py.File('noise_chunk_10000.hdf5', 'w')
f_hdf5_chunk_10000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10000,100))
f_hdf5_chunk_10000.close()
Run Code Online (Sandbox Code Playgroud)

结果如下:

-rw-rw-r--. 1 alex alex  8341134 Oct  7 21:53 noise_chunk_10000.hdf5
-rw-rw-r--. 1 alex alex  8416441 Oct  7 21:53 noise_chunk_1000.hdf5
-rw-rw-r--. 1 alex alex  9096936 Oct  7 21:53 noise_chunk_100.hdf5
-rw-rw-r--. 1 alex alex 16304949 Oct  7 21:53 noise_chunk_10.hdf5
-rw-rw-r--. 1 alex alex 85770613 Oct  7 21:53 noise_chunk_1.hdf5
Run Code Online (Sandbox Code Playgroud)

因此,随着块变小,文件的大小增加.