如何在大文本python中高效搜索相似的子字符串？

Question

如何在大文本python中高效搜索相似的子字符串？

让我尝试用一个例子来解释我的问题，我有一个很大的语料库和一个子字符串，如下所示，

corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

Run Code Online (Sandbox Code Playgroud)

子串和语料库非常相似，但不完全一样，

如果我做类似的事情，

import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar

Run Code Online (Sandbox Code Playgroud)

在语料库中，子字符串如下所示，与我的子字符串有点不同，因为正则表达式搜索失败，有人可以建议一个非常好的替代方案来进行类似的子字符串查找，

until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now

Run Code Online (Sandbox Code Playgroud)

我确实尝试过 difflib 库，但它不能满足我的用例。

一些背景信息，

我现在拥有的子字符串是前段时间使用此 regex 从预处理语料库中获得的re.sub("[^a-zA-Z]", " ", corpus)。

但现在我需要使用该子字符串，我必须在语料库文本中进行反向查找，并找到语料库中的开始和结束索引。

Answer 1

Sha*_*ger 3

实际上，您并不需要进行太多的模糊匹配，至少对于给出的示例而言是如此；text 只能在内的空格中更改substring，并且只能通过添加至少一个非字母字符（可以替换空格，但空格不能在不替换的情况下删除）来更改。这意味着您可以直接从子字符串构造一个正则表达式，并在单词之间使用通配符search（或finditer）corpus，并且生成的匹配对象将告诉您匹配的开始和结束位置：

\n

import re\n\n# Allow any character between whitespace-separated "words" except ASCII\n# alphabetic characters\nssre = re.compile(r\'[^a-z]+\'.join(substring.split()), re.IGNORECASE)\n\nif m := ssre.search(corpus):\n    print(m.start(), m.end())\n\n    print(repr(m.group(0)))\n

Run Code Online (Sandbox Code Playgroud)\n

在线尝试一下！

\n

它正确识别了比赛开始（索引 217）和结束（索引 771）的位置corpus；.group(0)如果您愿意，可以直接为您提取匹配的文本（需要索引的情况并不常见，因此很有可能您只是要求它们提取真实文本，并.group(0)直接执行此操作）。输出是：

\n

import re\n\n# Allow any character between whitespace-separated "words" except ASCII\n# alphabetic characters\nssre = re.compile(r\'[^a-z]+\'.join(substring.split()), re.IGNORECASE)\n\nif m := ssre.search(corpus):\n    print(m.start(), m.end())\n\n    print(repr(m.group(0)))\n

Run Code Online (Sandbox Code Playgroud)\n

如果空格可能被删除而不被替换，只需将+量词更改为*（正则表达式会运行得慢一点，因为它不能轻易短路，但仍然可以工作，并且应该运行得足够快）。

\n

如果您需要处理非 ASCII 字母字符，正则表达式连接器可以从更改为r\'[^a-z]+\'等效的r\'[\\W\\d_]+\'（这意味着“匹配所有非单词字符 [非字母数字和非下划线]，加上数字字符和下划线”）；它读起来有点尴尬，但它可以正确地处理东西\xc3\xa9（将其视为单词的一部分，而不是连接符）。

\n

虽然它不会像那样灵活difflib，但当您知道没有删除或添加任何单词时，这只是间距和标点符号的问题，这非常有效，并且运行速度应该比真正的模糊匹配解决方案快得多（这需要做更多的工作来处理接近匹配的概念）。

\n

归档时间：	2 年，10 月前
查看次数：	795 次
最近记录：	2 年，10 月前