查询保存为 npz 的 NumPy 数组的 NumPy 数组很慢

Question

查询保存为 npz 的 NumPy 数组的 NumPy 数组很慢

Fra*_*urt 2 python arrays performance numpy

我生成一个 npz 文件如下：

import numpy as np
import os

# Generate npz file
dataset_text_filepath = 'test_np_load.npz'
texts = []
for text_number in range(30000): 
    texts.append(np.random.random_integers(0, 20000, 
                 size = np.random.random_integers(0, 100)))
texts = np.array(texts)
np.savez(dataset_text_filepath, texts=texts)

Run Code Online (Sandbox Code Playgroud)

这给了我这个 ~7MiB npz 文件（基本上只有 1 个变量texts，它是一个 Numpy 数组的 NumPy 数组）：

我加载了numpy.load()：

# Load data
dataset = np.load(dataset_text_filepath)

Run Code Online (Sandbox Code Playgroud)

如果我按如下方式查询，则需要几分钟：

# Querying data: the slow way
for i in range(20):
    print('Run {0}'.format(i))
    random_indices = np.random.randint(0, len(dataset['texts']), size=10)
    dataset['texts'][random_indices]

Run Code Online (Sandbox Code Playgroud)

而如果我查询如下，它需要不到 5 秒：

# Querying data: the fast way
data_texts = dataset['texts']
for i in range(20):
    print('Run {0}'.format(i))
    random_indices = np.random.randint(0, len(data_texts), size=10)
    data_texts[random_indices]

Run Code Online (Sandbox Code Playgroud)

为什么第二种方法比第一种方法快得多？

Answer 1

hpa*_*ulj 5

dataset['texts']每次使用时读取文件。 loadof anpz只返回文件加载器，而不是实际数据。这是一个“惰性加载器”，仅在访问时加载特定数组。该load文档可更清楚，但他们说：

- If the file is a ``.npz`` file, the returned value supports the context
  manager protocol in a similar fashion to the open function::

    with load('foo.npz') as data:
        a = data['a']

  The underlying file descriptor is closed when exiting the 'with' block.

Run Code Online (Sandbox Code Playgroud)

并从savez：

 When opening the saved ``.npz`` file with `load` a `NpzFile` object is
returned. This is a dictionary-like object which can be queried for
its list of arrays (with the ``.files`` attribute), and for the arrays
themselves.

Run Code Online (Sandbox Code Playgroud)

更多细节在 help(np.lib.npyio.NpzFile)

归档时间：	10 年，2 月前
查看次数：	1425 次
最近记录：	10 年，2 月前