use*_*_12 8 python string python-3.x
让我尝试用一个例子来解释我的问题,我有一个很大的语料库和一个子字符串,如下所示,
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
Run Code Online (Sandbox Code Playgroud)
子串和语料库非常相似,但不完全一样,
如果我做类似的事情,
import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar
Run Code Online (Sandbox Code Playgroud)
在语料库中,子字符串如下所示,与我的子字符串有点不同,因为正则表达式搜索失败,有人可以建议一个非常好的替代方案来进行类似的子字符串查找,
until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now
Run Code Online (Sandbox Code Playgroud)
我确实尝试过 difflib 库,但它不能满足我的用例。
一些背景信息,
我现在拥有的子字符串是前段时间使用此 regex 从预处理语料库中获得的re.sub("[^a-zA-Z]", " ", corpus)
。
但现在我需要使用该子字符串,我必须在语料库文本中进行反向查找,并找到语料库中的开始和结束索引。
实际上,您并不需要进行太多的模糊匹配,至少对于给出的示例而言是如此;text 只能在 内的空格中更改substring
,并且只能通过添加至少一个非字母字符(可以替换空格,但空格不能在不替换的情况下删除)来更改。这意味着您可以直接从子字符串构造一个正则表达式,并在单词之间使用通配符search
(或finditer
)corpus
,并且生成的匹配对象将告诉您匹配的开始和结束位置:
import re\n\n# Allow any character between whitespace-separated "words" except ASCII\n# alphabetic characters\nssre = re.compile(r\'[^a-z]+\'.join(substring.split()), re.IGNORECASE)\n\nif m := ssre.search(corpus):\n print(m.start(), m.end())\n\n print(repr(m.group(0)))\n
Run Code Online (Sandbox Code Playgroud)\n\n它正确识别了比赛开始(索引 217)和结束(索引 771)的位置corpus
;.group(0)
如果您愿意,可以直接为您提取匹配的文本(需要索引的情况并不常见,因此很有可能您只是要求它们提取真实文本,并.group(0)
直接执行此操作)。输出是:
import re\n\n# Allow any character between whitespace-separated "words" except ASCII\n# alphabetic characters\nssre = re.compile(r\'[^a-z]+\'.join(substring.split()), re.IGNORECASE)\n\nif m := ssre.search(corpus):\n print(m.start(), m.end())\n\n print(repr(m.group(0)))\n
Run Code Online (Sandbox Code Playgroud)\n如果空格可能被删除而不被替换,只需将+
量词更改为*
(正则表达式会运行得慢一点,因为它不能轻易短路,但仍然可以工作,并且应该运行得足够快)。
如果您需要处理非 ASCII 字母字符,正则表达式连接器可以从 更改为r\'[^a-z]+\'
等效的r\'[\\W\\d_]+\'
(这意味着“匹配所有非单词字符 [非字母数字和非下划线],加上数字字符和下划线”);它读起来有点尴尬,但它可以正确地处理东西\xc3\xa9
(将其视为单词的一部分,而不是连接符)。
虽然它不会像 那样灵活difflib
,但当您知道没有删除或添加任何单词时,这只是间距和标点符号的问题,这非常有效,并且运行速度应该比真正的模糊匹配解决方案快得多(这需要做更多的工作来处理接近匹配的概念)。