"模糊匹配"字符串的算法

Ale*_*nov 24 string algorithm search fuzzy-search

通过模糊匹配,我不是指Levenshtein距离或类似的相似字符串,而是它在TextMate/Ido/Icicles中的使用方式:给定一个字符串列表,找到包含搜索字符串中所有字符的字符串,但可能还有其他字符串之间的人物,更喜欢最合适.

Mat*_* M. 31

我终于明白了你在找什么.这个问题很有意思,但是看看你发现的两种算法似乎人们对规范有不同的看法;)

我认为更清楚地陈述问题和要求会很有用.

问题:

我们正在寻找一种加快输入速度的方法,允许用户只键入他们实际想要的关键字的几个字母,并为他们提供一个可供选择的列表.

  1. 期望输入的所有字母都在关键字中
  2. 期望输入中的字母在关键字中具有相同的顺序
  3. 返回的关键字列表应以一致(可再现)的顺序显示
  4. 该算法应该不区分大小写

分析:

前两个要求可以总结如下:对于输入,axg我们正在寻找与该正则表达式匹配的单词[^a]*a[^x]*x[^g]*g.*

第三个要求是故意松散的.单词应该出现在列表中的顺序需要保持一致......但是很难猜测评分方法是否优于字母顺序.如果列表是极端的,那么评分方法可能会更好,但是对于简短列表,眼睛更容易在列表中查找以明显方式排序的特定项目.

此外,字母顺序在打字过程中具有一致性的优点:即添加字母不会完全重新排序列表(对于眼睛和大脑来说很痛苦),它只是过滤掉不再匹配的项目.

关于处理unicode字符没有精确性,例如完全à类似于a或另一个字符?由于我不知道目前在关键字中使用这些字符的语言,我现在就让它滑倒.

我的解决方案

对于任何输入,我将构建前面表达的正则表达式.它适用于Python,因为该语言已经具有不区分大小写的匹配.

然后我会匹配我的(按字母顺序排序)关键字列表,并输出它以便过滤.

在伪代码中:

WORDS = ['Bar', 'Foo', 'FooBar', 'Other']

def GetList(input, words = WORDS):
  expr = ['[^' + i + ']*' + i for i in input]
  return [w for w in words if re.match(expr, w, re.IGNORECASE)]
Run Code Online (Sandbox Code Playgroud)

我本可以使用单行,但认为它会模糊代码;)

此解决方案适用于增量情况(即,当您匹配用户类型并因此保持重建时),因为当用户添加角色时,您可以简单地重新过滤刚刚计算的结果.从而:

  • 要么字符很少,所以匹配很快,列表的长度也无关紧要
  • 要么有很多字符,这意味着我们正在过滤一个短列表,因此如果匹配需要更长的元素,则无关紧要

我还应该注意,这个正则表达式不涉及反向跟踪,因此非常有效.它也可以建模为一个简单的状态机.


Nig*_*nan 9

Levenshtein的"编辑距离"算法肯定会对你要做的事情起作用:它们会测量两个单词或地址或电话号码,诗篇,独白和学术文章之间的匹配程度,让你对结果并选择最佳匹配.

一个更轻量级的应用程序是计算常见的子串:它不如Levenshtein好,但它提供了可用的结果,并以慢速语言快速运行,可以访问快速的"InString"函数.

几年前我在Excellerando发表了Excel'模糊查找',使用'FuzzyMatchScore'功能,据我所知,确切地说你需要什么:

http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html

当然,它是在Visual Basic for Applications中.小心谨慎,十字架和大蒜:

Public Function SumOfCommonStrings( _
                            ByVal s1 As String, _
                            ByVal s2 As String, _
                            Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
                            Optional iScore As Integer = 0 _
                                ) As Integer

Application.Volatile False

' N.Heffernan 06 June 2006 
' THIS CODE IS IN THE PUBLIC DOMAIN


' Function to measure how much of String 1 is made up of substrings found in String 2

' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched

' Watch out for strings of differing lengths:
'
'    SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
'     SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.


' This is coded for clarity, not for performance.

Dim arr() As Integer    ' Scoring matrix
Dim n As Integer        ' length of s1
Dim m As Integer        ' length of s2
Dim i As Integer        ' start position in s1
Dim j As Integer        ' start position in s2
Dim subs1 As String     ' a substring of s1
Dim len1 As Integer     ' length of subs1

Dim sBefore1            ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2

Dim s3 As String


SumOfCommonStrings = iScore

n = Len(s1)
m = Len(s2)

If s1 = s2 Then
    SumOfCommonStrings = n
    Exit Function
End If

If n = 0 Or m = 0 Then
    Exit Function
End If

's1 should always be the shorter of the two strings:
If n > m Then
    s3 = s2
    s2 = s1
    s1 = s3
    n = Len(s1)
    m = Len(s2)
End If

n = Len(s1)
m = Len(s2)

' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
    SumOfCommonStrings = n
    Exit Function
End If

For len1 = n To 1 Step -1

    For i = 1 To n - len1 + 1

        subs1 = Mid(s1, i, len1)
        j = 0
        j = InStr(1, s2, subs1, Compare)

        If j > 0 Then

            ' We've found a matching substring...
            iScore = iScore + len1            

          ' Now clip out this substring from s1 and s2...
          ' And search the fragments before and after this excision:


            If i > 1 And j > 1 Then
                sBefore1 = left(s1, i - 1)
                sBefore2 = left(s2, j - 1)
                iScore = SumOfCommonStrings(sBefore1, _
                                            sBefore2, _
                                            Compare, _
                                            iScore)
            End If


            If i + len1 < n And j + len1 < m Then
                sAfter1 = right(s1, n + 1 - i - len1)
                sAfter2 = right(s2, m + 1 - j - len1)
                iScore = SumOfCommonStrings(sAfter1, _
                                            sAfter2, _
                                            Compare, _
                                            iScore)
            End If


            SumOfCommonStrings = iScore
            Exit Function

        End If

    Next


Next


End Function


Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
Dim min As Integer

  min = a

  If b < min Then
        min = b
  End If

  If c < min Then
        min = c
  End If

  Minimum = min

End Function