如何高效创建子字符串

Mar*_*l B 5 python performance

给定一个字符串(通常是一个句子),我想提取 lengths 的所有子字符串3, 4, 5, 6。如何仅使用 Python 的标准库有效地实现这一目标?这是我的方法,我正在寻找一种更快的方法。对我来说,这三个外循环似乎都是不可避免的,但也许有一个低级优化的解决方案itertools

import time

def naive(test_sentence, start, end):
    grams = []
    for word in test_sentence:
        for size in range(start, end):
            for i in range(len(word)):
                k = word[i:i+size]
                if len(k)==size:
                    grams.append(k)
    return grams

n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")

start_time = time.time()
for _ in range(n):
    naive(test_sentence, start, end)
end_time = time.time()

print(f"{end-start} seconds for naive approach")
Run Code Online (Sandbox Code Playgroud)

输出naive()

['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']
Run Code Online (Sandbox Code Playgroud)

第二个版本:

def naive2(test_sentence,start,end):
    grams = []
    for word in test_sentence:
        if len(word) >= start:
            for size in range(start,end):
                for i in range(len(word)-size+1):
                    grams.append(word[i:i+size])
    return grams
Run Code Online (Sandbox Code Playgroud)

Jér*_*ard 4

好吧,我认为这是不可能改进算法的,但是你可以对函数进行微优化:

\n
def naive3(test_sentence,start,end):\n    rng = range(start,end)\n    return [word[i:i+size] for word in test_sentence\n                           if len(word) >= start\n                           for size in rng\n                           for i in range(len(word)+1-size)]\n
Run Code Online (Sandbox Code Playgroud)\n

Python 3.8 引入了对性能非常有用的赋值表达式。因此,如果您可以使用最新版本,那么您可以编写:

\n
def naive4(test_sentence,start,end):\n    rng = range(start,end)\n    return [word[i:i+size] for word in test_sentence \n                           if (lenWord := len(word)+1) > start\n                           for size in rng\n                           for i in range(lenWord-size)]\n
Run Code Online (Sandbox Code Playgroud)\n

以下是性能结果:

\n
naive2: 8.28 \xc2\xb5s \xc2\xb1  55 ns per call\nnaive3: 7.28 \xc2\xb5s \xc2\xb1 124 ns per call\nnaive4: 6.86 \xc2\xb5s \xc2\xb1  48 ns per call    (20% faster than naive2)\n
Run Code Online (Sandbox Code Playgroud)\n

请注意,一半的时间naive4花费在创建word[i:i+size]字符串对象上,其余时间主要花费在 CPython 解释器中(主要是由于可变大小整数对象的创建/引用计数/删除)。

\n