如何改进 Python 中列表的模式匹配

Question

如何改进 Python 中列表的模式匹配

use*_*419 9 python list python-itertools

我有一个很大的列表，其中可能包含数千到数百万个条目。我设置了一个有限大小的窗口来滑过列表。我需要计算窗口中匹配的元素并通过一次向前滑动窗口 1 的位置来重复该过程。这是一个列表的简单示例

L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]

Run Code Online (Sandbox Code Playgroud)

假设窗口的长度为 3，它将捕获

[1 2 1] 包含一对匹配元素 (1 & 1)
将窗口向前移动 1 个位置 => [2 1 3]，没有匹配的元素
将窗口向前移动 1 个位置 => [1 3 4]，没有匹配的元素
将窗口向前移动 1 个位置 => [3 4 5]，没有匹配的元素
将窗口向前移动 1 个位置 => [4 5 1]，没有匹配的元素
将窗口向前移动 1 个位置 => [5 1 2]，没有匹配的元素
将窗口向前移动 1 个位置 => [1 2 1], 1 个匹配元素 (1 & 1)
将窗口向前移动 1 个位置 => [2 1 2], 1 个匹配元素 (2 & 2)
将窗口向前移动 1 个位置 => [1 2 2], 1 个匹配元素 (2 & 2)
将窗口向前移动 1 个位置 => [2 2 2]，3 个匹配元素 ([2 2 -], [2 - 2], [- 2 2])
将窗口向前移动 1 个位置 => [2 2 3], 1 个匹配元素 (2 & 2)

所以总共有 1 + 1 + 1 + 1 + 3 + 1 = 8 个匹配对。我找到了使用 itertools 查找窗口中所有元素的组合并开发代码来查找所有匹配对的想法

import itertools
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = 0
for n in range(len(L)-winlen+1):
    window = [L[n+i] for i in range(winlen)]
    A = list(itertools.combinations(window, 2))
    match = [a==b for a, b in A]
    totalMatch += sum(match)

Run Code Online (Sandbox Code Playgroud)

它适用于一个简短的列表，但对于列表和窗口变大，这段代码太慢了。我已经使用 C++ 多年，决定改用 python，如果有任何提高代码效率的提示，我将不胜感激。

Answer 1

Man*_*uel 9

更有效地跟踪窗口中的数据？这是 O(|L|) 而不是你的 O(|L|*winlen^2)。它保持窗口的元素计数ctr和窗口的匹配项match。例如，当一个新值进入窗口并且窗口中已经有该值的两个实例时，您将获得两个新匹配项。类似地，对于掉出窗口的值，它需要与窗口中的其他实例一样多的匹配项。

from collections import Counter

L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3

totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
    
    # Remove old element falling out of window
    if i >= winlen:
        ctr[L[i-winlen]] -= 1
        match -= ctr[L[i-winlen]]

    # Add new element to window
    match += ctr[x]
    ctr[x] += 1

    # Update the total (for complete windows)
    if i >= winlen - 1:
        totalMatch += match

print(totalMatch)

Run Code Online (Sandbox Code Playgroud)

基准测试结果与L和winlen乘以20：

 38.75 ms  original
  0.18 ms  Manuel

 38.73 ms  original
  0.19 ms  Manuel

 38.87 ms  original
  0.18 ms  Manuel

Run Code Online (Sandbox Code Playgroud)

基准代码（还包括长度为 0 到 9 的所有数字 1 到 3 列表的测试代码）：

from timeit import repeat
import itertools
from itertools import product
from collections import Counter

def original(L, winlen):
    totalMatch = 0
    for n in range(len(L)-winlen+1):
        window = [L[n+i] for i in range(winlen)]
        A = list(itertools.combinations(window, 2))
        match = [a==b for a, b in A]
        totalMatch += sum(match)
    return totalMatch

def Manuel(L, winlen):
    totalMatch = match = 0
    ctr = Counter()
    for i, x in enumerate(L):
        if i >= winlen:
            ctr[L[i-winlen]] -= 1
            match -= ctr[L[i-winlen]]
        match += ctr[x]
        ctr[x] += 1
        if i >= winlen - 1:
            totalMatch += match
    return totalMatch

def test():
    for n in range(10):
        print('testing', n)
        for T in product([1, 2, 3], repeat=n):
            L = list(T)
            winlen = 3
            expect = original(L, winlen)
            result = Manuel(L, winlen)
            assert result == expect, (L, expect, result)

def bench():
    L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20
    winlen = 3 * 20
    for _ in range(3):
        for func in original, Manuel:
            t = min(repeat(lambda: func(L, winlen), number=1))
            print('%6.2f ms ' % (t * 1e3), func.__name__)
        print()

test()
bench()

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，8 月前
查看次数：	266 次
最近记录：	4 年，8 月前