以模糊的方式获得正则表达式组

Question

以模糊的方式获得正则表达式组

Moh*_*ANI 8 python regex string fuzzy-search

我有一个很大的单词列表(大约200k):

["cat", "the dog", "elephant", "the angry tiger"]

Run Code Online (Sandbox Code Playgroud)

我用模糊创建了这个正则表达式:

regex = "(cat){e<3}|(the dog){e<3}|(elephant){e<3}|(the angry tiger){e<3}"

Run Code Online (Sandbox Code Playgroud)

我输入了句子:

sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"
...

Run Code Online (Sandbox Code Playgroud)

我想得到的是:

res1 = ["the dog"]
res2 = ["elephant", "cat"]

Run Code Online (Sandbox Code Playgroud)

我试过这个例子:

re.findall(regex, sentence2, flags=re.IGNORECASE|re.UNICODE)

Run Code Online (Sandbox Code Playgroud)

但这输出了我:

["elephent", "kat"]

Run Code Online (Sandbox Code Playgroud)

知道如何用正确的单词得到正确的答案吗？我想要的是为每场比赛获得正则表达式捕获组,但我很难这样做.

也许我不这样做的权利,也许正则表达式的方法是不好的一个,但if item in list有一个for循环是太长的方式来执行.

Answer 1

wol*_*ats 3

可以通过手动构建正则表达式并命名组来完成：

import regex as re

a = ["cat", "the dog", "elephant", "the angry tiger"]
a_dict = { 'g%d' % (i):item for i,item in enumerate(a) } 

regex = "|".join([ r"\b(?<g%d>(%s){e<3})\b" % (i,item) for i,item in enumerate(a) ])

sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"

for match in re.finditer(regex, sentence2, flags=re.IGNORECASE|re.UNICODE):
    for key,value in match.groupdict().items():
        if value is not None:
            print ("%s: %s" % (a_dict.get(key), value))

Run Code Online (Sandbox Code Playgroud)

elephant:  elephent
cat:  kat

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	167 次
最近记录：	7 年，9 月前