数据比较

wcm*_*wcm 5 sql-server textmatching

我们有一个SQL Server表,其中包含公司名称,地址和联系人姓名(以及其他名称).

我们会定期从外部来源收到数据文件,要求我们与此表格进行匹配.不幸的是,数据略有不同,因为它来自完全不同的系统.例如,我们有"123 E. Main St." 我们收到"东大街123号".另一个例子,我们有"Acme,LLC",文件包含"Acme Inc.".另一个是,我们有"埃德史密斯",他们有"爱德华史密斯"

我们有一个遗留系统,它利用一些相当复杂的CPU密集型方法来处理这些匹配.一些涉及纯SQL,另一些涉及Access数据库中的VBA代码.目前的系统很好但不完美,而且很麻烦且难以维护

这里的管理层希望扩大其使用范围.将继承系统支持的开发人员希望用更灵活的解决方案替换它,这需要更少的维护.

是否有一种普遍接受的方式来处理这种数据匹配?

tor*_*ial 4

这是我为几乎相同的堆栈编写的内容(我们需要标准化硬件的制造商名称,并且存在各种变化)。不过,这是客户端(确切地说是 VB.Net)——并使用 Levenshtein 距离算法(经过修改以获得更好的结果):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
        Dim bestMatch As String = ""
        Dim bestDistance As Integer = 1000 'Almost anything should be better than that!

        For Each matchCandidate As String In stringList
            Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
            If candidateDistance < bestDistance Then
                bestMatch = matchCandidate
                bestDistance = candidateDistance
            End If
        Next

        Return bestMatch
    End Function

    'This will be used to determine how similar strings are.  Modified from the link below...
    'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
    Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
        Dim sLength As Integer = s.Length ' length of s
        Dim tLength As Integer = t.Length ' length of t
        Dim lvCost As Integer ' cost
        Dim lvDistance As Integer = 0
        Dim zeroCostCount As Integer = 0

        Try
            ' Step 1
            If tLength = 0 Then
                Return sLength
            ElseIf sLength = 0 Then
                Return tLength
            End If

            Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
            Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}

            ' fill first row
            For lvIndex As Integer = 0 To sLength
                poBuffer(lvIndex) = lvIndex
            Next

            'fill first column
            For lvIndex As Integer = 1 To tLength
                poBuffer(lvIndex * (sLength + 1)) = lvIndex
            Next

            For lvRowIndex As Integer = 0 To sLength - 1
                Dim s_i As Char = s(lvRowIndex)
                For lvColIndex As Integer = 0 To tLength - 1
                    If s_i = t(lvColIndex) Then
                        lvCost = 0
                        zeroCostCount += 1
                    Else
                        lvCost = 1
                    End If
                    ' Step 6
                    Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                    Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                    Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                    Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                    lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                    poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                Next
            Next
        Catch ex As ThreadAbortException
            Err.Clear()
        Catch ex As Exception
            WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
        End Try

        Return lvDistance - zeroCostCount
    End Function
Run Code Online (Sandbox Code Playgroud)