for/in/if 列表理解随着大量匹配变得非常缓慢

Question

for/in/if 列表理解随着大量匹配变得非常缓慢

Mic*_*elA 1 python performance list-comprehension list

我的 Python 2.7 代码中有以下列表推导式，它返回行号（索引）和一长串行中的行：

results = [[lines.index(line), line] for line in lines
            if search_item in line.lower()]

Run Code Online (Sandbox Code Playgroud)

如果结果数量很少，这是闪电般的快速：

The search item is: [ 1330 ]
Before string pre-processing, the time is: 0.0000
The number of lines is: 1,028,952
After string pre-processing, the time is: 0.2500
The number of results is: 249

Run Code Online (Sandbox Code Playgroud)

“字符串预处理”就是我所说的结果 = 上面的操作。

这是相同的操作，但使用“1330”作为搜索项而不是“1330”。这个产生 6,049 个匹配而不是 249 个：

The search item is: [1330]
Before string pre-processing, the time is: 0.0000
The number of lines is: 1,028,952
After string pre-processing, the time is: 10.3180
The number of results is: 6,049

Run Code Online (Sandbox Code Playgroud)

如您所见，10 秒与 1/4 秒...此外，“1330”和“1330”搜索使用 for 循环分别在 2.4 和 3.2 秒内运行：

for lineNum, line in enumerate(lines):
    if search_item in line.lower():
        return lineNum, line

Run Code Online (Sandbox Code Playgroud)

因此，列表理解在 249 个结果的情况下使性能提高了 10 倍，但对于 6,049 个结果则慢了 3+x...

显然，问题不在于列表理解的 if/in 部分（两个搜索都扫描所有 1M+ 行并接受或拒绝每一行），而在于构建一个在第二种情况下“很长”的结果列表。换句话说，瓶颈似乎在

results = [lines.index(line), line]

Run Code Online (Sandbox Code Playgroud)

部分理解。

我想我很惊讶列表理解对于大型结果集变得如此缓慢（并且 6K 真的没有那么大）。我错过了什么？我应该使用一种不同的方法来始终优于 for 循环吗？

Answer 1

Mar*_*ers 5

该list.index()呼叫必须通过搜索所有行找到匹配。对于 N 行，您执行 O(N^2) 步；1000 行变成 100 万步，依此类推。对于 6k 行，就是 3600 万步^*

如果您只需要一个行号，请使用该enumerate()函数生成一个：

results = [[index, line] for index, line in enumerate(lines)
            if search_item in line.lower()]

Run Code Online (Sandbox Code Playgroud)

enumerate()随时添加一个运行计数器，让您的算法只执行 O(N) 步。您已经在完整的for循环语句中使用了它，但没有在您的列表理解中使用。

但是，如果您有重复的行，输出会有所不同；lines.index()找到第一个匹配项，同时enumerate()生成唯一的行号。

^* Big-O 符号为我们提供了算法的渐近行为。因为list.index()对于给定的行x只需要扫描（最多）x行来找到索引，如果你对迭代的每一行都这样做，你总共只需要 1 + 2 + 3 + ... x步，这是一个三角形数。因此，总共采取了“仅”((N * (N + 1)) / 2) 步，大约为 1/2 N^2 步。但是当 N 趋于无穷大时，乘数就不再重要了，你最终会得到 O(N^2)。

归档时间：	9 年，8 月前
查看次数：	963 次
最近记录：	9 年，8 月前