bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
smallString = "I_AM_HERE"
Run Code Online (Sandbox Code Playgroud)
我应该使用哪种有效的算法来查找与"smallString"紧密匹配的"bigString"的子字符串
output = "I_AM_THERE"
Run Code Online (Sandbox Code Playgroud)
与小字符串相比,输出可能具有很少的插入和删除.
编辑:找到一个很好的例子,非常接近我的问题:如何向正则表达式模糊搜索添加变量错误.蟒蛇
您可以使用几乎准备好的Everyones-regex包和模糊匹配:
>>> import regex
>>> bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
>>> regex.search('(?:I_AM_HERE){e<=1}',bigString).group(0)
'I_AM_THERE'
Run Code Online (Sandbox Code Playgroud)
要么:
>>> bigString = "AGAH_I_AM_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_NOWHERE_EREXXMHHGRFS"
>>> print(regex.findall('I_AM_(?:HERE){e<=3}',bigString))
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
Run Code Online (Sandbox Code Playgroud)
新的正则表达式模块(希望)将成为Python3.4的一部分
如果你有pip,只需键入pip install regex或pip3 install regex直到Python 3.4出来(正则表达式的一部分......)
回答评论 Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?
使用最佳匹配标志(?b)来获得单个最佳匹配:
print(regex.search(r'(?b)I_AM_(?:ERE){e<=3}', bigString).group(0))
# I_AM_THE
Run Code Online (Sandbox Code Playgroud)
或者与difflib结合使用或者使用levenshtein距离以及所有可接受的匹配列表到第一个文字:
import regex
def levenshtein(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1],
distances[index1+1],
newDistances[-1])))
distances = newDistances
return distances[-1]
bigString = "AGAH_I_AM_NOWHERE_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_HERE_EREXXMHHGRFS"
cl=[(levenshtein(s,'I_AM_HERE'),s) for s in regex.findall('I_AM_(?:HERE){e<=3}',bigString)]
print(cl)
print([t[1] for t in sorted(cl, key=lambda t: t[0])])
print(regex.search(r'(?e)I_AM_(?:ERE){e<=3}', bigString).group(0))
Run Code Online (Sandbox Code Playgroud)
打印:
[(3, 'I_AM_NOWHERE'), (1, 'I_AM_THERE'), (0, 'I_AM_HERE')]
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7617 次 |
| 最近记录: |