使用 pickle 对大型 numpy 数组进行反序列化比使用 numpy 慢一个数量级

Dav*_*rks 9 python numpy deserialization python-3.8

我正在反序列化大型 numpy 数组(本例中为 500MB),我发现结果在方法之间存在数量级差异。以下是我计时的 3 种方法。

我正在从multiprocessing.shared_memory包中接收数据,因此数据作为memoryview对象出现在我面前。但在这些简单的例子中,我只是预先创建了一个字节数组来运行测试。

我想知道这些方法是否有任何错误,或者是否有我没有尝试过的其他技术。如果您想快速移动数据而不是仅为 IO 锁定 GIL,那么 Python 中的反序列化是一个真正的问题。关于为什么这些方法变化如此之大的一个很好的解释也是一个很好的答案。

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))
Run Code Online (Sandbox Code Playgroud)

结果:

Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec
Run Code Online (Sandbox Code Playgroud)

第二个选项是最快的,但明显不太优雅,因为我需要显式序列化形状和 dtype 信息。

Dou*_*s M 3

我发现你的问题很有用,我正在寻找最好的 numpy 序列化,并确认 np.load() 是最好的,除了它pyarrow在下面的附加测试中被击败。Arrow 现在是一个超级流行的分布式计算数据序列化框架(例如 Spark,...)

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()

serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))
Run Code Online (Sandbox Code Playgroud)

Databricks Runtime 8.3ML Python 3.8、Numpy 1.19.2、Pyarrow 1.0.1 上 i3.2xlarge 的结果

Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec
Run Code Online (Sandbox Code Playgroud)

你的 BytesIO 结果大约是我的 100 倍,我不知道为什么。