Ale*_*nov 24 string algorithm search fuzzy-search
通过模糊匹配,我不是指Levenshtein距离或类似的相似字符串,而是它在TextMate/Ido/Icicles中的使用方式:给定一个字符串列表,找到包含搜索字符串中所有字符的字符串,但可能还有其他字符串之间的人物,更喜欢最合适.
Mat*_* M. 31
我终于明白了你在找什么.这个问题很有意思,但是看看你发现的两种算法似乎人们对规范有不同的看法;)
我认为更清楚地陈述问题和要求会很有用.
问题:
我们正在寻找一种加快输入速度的方法,允许用户只键入他们实际想要的关键字的几个字母,并为他们提供一个可供选择的列表.
分析:
前两个要求可以总结如下:对于输入,axg我们正在寻找与该正则表达式匹配的单词[^a]*a[^x]*x[^g]*g.*
第三个要求是故意松散的.单词应该出现在列表中的顺序需要保持一致......但是很难猜测评分方法是否优于字母顺序.如果列表是极端的,那么评分方法可能会更好,但是对于简短列表,眼睛更容易在列表中查找以明显方式排序的特定项目.
此外,字母顺序在打字过程中具有一致性的优点:即添加字母不会完全重新排序列表(对于眼睛和大脑来说很痛苦),它只是过滤掉不再匹配的项目.
关于处理unicode字符没有精确性,例如完全à类似于a或另一个字符?由于我不知道目前在关键字中使用这些字符的语言,我现在就让它滑倒.
我的解决方案
对于任何输入,我将构建前面表达的正则表达式.它适用于Python,因为该语言已经具有不区分大小写的匹配.
然后我会匹配我的(按字母顺序排序)关键字列表,并输出它以便过滤.
在伪代码中:
WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
def GetList(input, words = WORDS):
expr = ['[^' + i + ']*' + i for i in input]
return [w for w in words if re.match(expr, w, re.IGNORECASE)]
Run Code Online (Sandbox Code Playgroud)
我本可以使用单行,但认为它会模糊代码;)
此解决方案适用于增量情况(即,当您匹配用户类型并因此保持重建时),因为当用户添加角色时,您可以简单地重新过滤刚刚计算的结果.从而:
我还应该注意,这个正则表达式不涉及反向跟踪,因此非常有效.它也可以建模为一个简单的状态机.
Levenshtein的"编辑距离"算法肯定会对你要做的事情起作用:它们会测量两个单词或地址或电话号码,诗篇,独白和学术文章之间的匹配程度,让你对结果并选择最佳匹配.
一个更轻量级的应用程序是计算常见的子串:它不如Levenshtein好,但它提供了可用的结果,并以慢速语言快速运行,可以访问快速的"InString"函数.
几年前我在Excellerando发表了Excel'模糊查找',使用'FuzzyMatchScore'功能,据我所知,确切地说你需要什么:
http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html
当然,它是在Visual Basic for Applications中.小心谨慎,十字架和大蒜:
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n > m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j > 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Now clip out this substring from s1 and s2...
' And search the fragments before and after this excision:
If i > 1 And j > 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 < n And j + len1 < m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
Exit Function
End If
Next
Next
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b < min Then
min = b
End If
If c < min Then
min = c
End If
Minimum = min
End Function
| 归档时间: |
|
| 查看次数: |
11640 次 |
| 最近记录: |