从给定索引数组的Python列表中提取子列表的最快方法

Question

从给定索引数组的Python列表中提取子列表的最快方法

我有l任何类型的对象的大型Python列表，还有另一个指向 list 中某些元素的整数索引的大型列表i（甚至NumPy数组）l。

问题是创建另一个列表的最快（最有效）的方法是什么，l2该列表包含的元素l以及来自的索引i。

最简单的方法是进行列表理解：

l2 = [l[si] for si in i]
# Use np.nditer(i) instead of i, for NumPy array case

Run Code Online (Sandbox Code Playgroud)

但这是最快的方法吗？

列表理解是一个 Python 循环，因此对于大型列表可能会很慢，也许标准库中有一些内置的 Python 方法可以高效编写C来实现此任务？或者也许NumPy有这样的方法通过 numpy 数组索引 Python 列表？

也许标准 python 库中有一些简单快速的函数可以对 NumPy 的np.take执行相同的操作，如下面的假想代码所示：

import listtools
l2 = listtools.take(l, indexes)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pau*_*zer 5

operator.itemgetter通过使用支持批量查找，您可以获得较小的加速（在下面的示例中约为 25％）：

>>> import string
>>> import random
>>> import operator as op
>>> from timeit import timeit

# create random lists
>>> l = [random.choice([*string.ascii_letters,*range(100)]) for _ in range(1000000)]
>>> i = [random.randint(0,999999) for _ in range(300000)]

# timings
>>> timeit(lambda:[l[si] for si in i],number=100)
3.0997245000035036
>>> timeit(lambda:list(map(l.__getitem__,i)),number=100)
2.892384369013598
>>> timeit(lambda:list(op.itemgetter(*i)(l)),number=100)
2.1787672539940104

Run Code Online (Sandbox Code Playgroud)

Answer 2

Art*_*oul 5

众所周知，NumPy数组还可以通过dtype = np.object_.

所以我决定测量 NumPy 与普通 Python 的使用速度。另外，正如我在问题中提到的，我还想解决索引是 numpy 整数数组的情况。

接下来的代码测量不同的情况，我们是否需要将源列表转换为 numpy 数组以及结果是否也应该转换。

在线尝试一下！

import string
from timeit import timeit
import numpy as np
np.random.seed(0)

letters = np.array(list(string.ascii_letters), dtype = np.object_)
nl = letters[np.random.randint(0, len(letters), size = (10 ** 6,))]
l = nl.tolist()
ni = np.random.permutation(np.arange(nl.size, dtype = np.int64))
i = ni.tolist()

pyt = timeit(lambda: [l[si] for si in i], number = 10)
print('python:', round(pyt, 3), flush = True)

for l_from_list in [True, False]:
    for i_from_list in [True, False]:
        for l_to_list in [True, False]:
            def Do():
                cl = np.array(l, dtype = np.object_) if l_from_list else nl
                ci = np.array(i, dtype = np.int64) if i_from_list else ni
                res = cl[ci]
                res = res.tolist() if l_to_list else res
                return res
            ct = timeit(lambda: Do(), number = 10)
            print(
                'numpy:', 'l_from_list', l_from_list, 'i_from_list', i_from_list, 'l_to_list', l_to_list,
                'time', round(ct, 3), 'speedup', round(pyt / ct, 2), flush = True
            )

Run Code Online (Sandbox Code Playgroud)

输出：

python: 2.279
numpy: l_from_list True  i_from_list True  l_to_list True  time 2.924 speedup 0.78
numpy: l_from_list True  i_from_list True  l_to_list False time 2.805 speedup 0.81
numpy: l_from_list True  i_from_list False l_to_list True  time 1.457 speedup 1.56
numpy: l_from_list True  i_from_list False l_to_list False time 1.312 speedup 1.74
numpy: l_from_list False i_from_list True  l_to_list True  time 2.352 speedup 0.97
numpy: l_from_list False i_from_list True  l_to_list False time 2.209 speedup 1.03
numpy: l_from_list False i_from_list False l_to_list True  time 0.894 speedup 2.55
numpy: l_from_list False i_from_list False l_to_list False time 0.75  speedup 3.04

Run Code Online (Sandbox Code Playgroud)

所以我们可以看到，如果我们将所有列表存储为 numpy 数组，那么我们就会获得3x加速！但如果只有索引是一个 numpy 数组，那么我们就能得到加速，1.56x这也非常好。如果所有内容都必须从列表来回转换，那么我们会获得加速0.78x，这意味着我们会放慢速度，因此，如果我们仅使用列表，那么通过 numpy 进行索引是没有帮助的。

归档时间：	5 年，9 月前
查看次数：	1028 次
最近记录：	5 年，9 月前