Python字符串出现次数正则表达式性能

Question

Python字符串出现次数正则表达式性能

我被要求找到给定字符串中出现的子字符串总数（不区分大小写，带/不带标点符号）。一些例子：

count_occurrences("Text with", "This is an example text with more than +100 lines") # Should return 1
count_occurrences("'example text'", "This is an 'example text' with more than +100 lines") # Should return 1
count_occurrences("more than", "This is an example 'text' with (more than) +100 lines") # Should return 1
count_occurrences("clock", "its 3o'clock in the morning") # Should return 0

Run Code Online (Sandbox Code Playgroud)

我选择了正则表达式，.count()因为我需要完全匹配，最后得到：

def count_occurrences(word, text):
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text, re.IGNORECASE))

Run Code Online (Sandbox Code Playgroud)

我得到了所有匹配的计数，但我的代码0.10secs在预期时间是0.025secs. 我错过了什么吗？有没有更好的（性能优化的）方法来做到这一点？

Answer 1

Tom*_*tah 3

好吧，我正在努力让它在没有正则表达式的情况下工作，因为我们都知道正则表达式很慢。这是我想出的：

\n

def count_occurrences(word, text):\n    spaces = [\' \', \'\\n\', \'(\', \'\xc2\xab\', \'\\u201d\', \'\\u201c\', \':\', "\'\'", "__"]\n    endings = spaces + [\'?\', \'.\', \'!\', \',\', \')\', \'"\', \'\xc2\xbb\']\n    s = text.lower().split(word.lower())\n    l = len(s)\n    return sum((\n            (i == 0 and (s[0] == \'\' or any(s[i].endswith(t) for t in spaces)) and (s[1] == \'\' or any(s[i+1].startswith(t) for t in endings))) \n            or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == \'\' or any(s[i+1].startswith(t) for t in endings)))\n            or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))\n        ) for i in range(l - 1))\n

Run Code Online (Sandbox Code Playgroud)\n

整个文件在 ideone 中运行：

\n

Ran 1 test in 0.025s\n\nOK\n

Run Code Online (Sandbox Code Playgroud)\n

这就是问题所要问的。

\n

逻辑非常简单。让我们拆分 by text，word都是小写的。现在让我们看看每对邻居。例如，如果索引 0 以有效分隔符结束，索引 1 以有效分隔符开头，则将其计为一次出现。让我们一直这样做到分裂的最后几个。

\n

spaces由于性能在这里很重要，因此我们必须注意和的顺序endings。我们基本上是在寻找列表中第一个满足条件的。因此，首先找到更常见的变量很重要。例如，如果我声明：

\n

spaces = [\'(\', \'\xc2\xab\', \'\\u201d\', \'\\u201c\', \':\', "\'\'", "__", \'\\n\', \' \']\n

Run Code Online (Sandbox Code Playgroud)\n

我得到的不是我的解决方案中的内容，而是几0.036秒钟的时间。

\n

例如，如果我声明一个数组：

\n

spaces = [\' \', \'\\n\', \'(\', \'\xc2\xab\', \'\\u201d\', \'\\u201c\', \':\', "\'\'", "__", \'?\', \'.\', \'!\', \',\', \')\', \'"\', \'\xc2\xbb\']\n

Run Code Online (Sandbox Code Playgroud)\n

它具有所有分隔符并且仅使用它，我得到 0.053 秒。这比我的解决方案多了 60%。

\n

以另一种顺序声明分隔符可能有更好的解决方案。

\n

归档时间：	4 年，10 月前
查看次数：	406 次
最近记录：	4 年，9 月前