wcm*_*wcm 5 sql-server textmatching
我们有一个SQL Server表,其中包含公司名称,地址和联系人姓名(以及其他名称).
我们会定期从外部来源收到数据文件,要求我们与此表格进行匹配.不幸的是,数据略有不同,因为它来自完全不同的系统.例如,我们有"123 E. Main St." 我们收到"东大街123号".另一个例子,我们有"Acme,LLC",文件包含"Acme Inc.".另一个是,我们有"埃德史密斯",他们有"爱德华史密斯"
我们有一个遗留系统,它利用一些相当复杂的CPU密集型方法来处理这些匹配.一些涉及纯SQL,另一些涉及Access数据库中的VBA代码.目前的系统很好但不完美,而且很麻烦且难以维护
这里的管理层希望扩大其使用范围.将继承系统支持的开发人员希望用更灵活的解决方案替换它,这需要更少的维护.
是否有一种普遍接受的方式来处理这种数据匹配?
这是我为几乎相同的堆栈编写的内容(我们需要标准化硬件的制造商名称,并且存在各种变化)。不过,这是客户端(确切地说是 VB.Net)——并使用 Levenshtein 距离算法(经过修改以获得更好的结果):
Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
Dim bestMatch As String = ""
Dim bestDistance As Integer = 1000 'Almost anything should be better than that!
For Each matchCandidate As String In stringList
Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
If candidateDistance < bestDistance Then
bestMatch = matchCandidate
bestDistance = candidateDistance
End If
Next
Return bestMatch
End Function
'This will be used to determine how similar strings are. Modified from the link below...
'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim sLength As Integer = s.Length ' length of s
Dim tLength As Integer = t.Length ' length of t
Dim lvCost As Integer ' cost
Dim lvDistance As Integer = 0
Dim zeroCostCount As Integer = 0
Try
' Step 1
If tLength = 0 Then
Return sLength
ElseIf sLength = 0 Then
Return tLength
End If
Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}
' fill first row
For lvIndex As Integer = 0 To sLength
poBuffer(lvIndex) = lvIndex
Next
'fill first column
For lvIndex As Integer = 1 To tLength
poBuffer(lvIndex * (sLength + 1)) = lvIndex
Next
For lvRowIndex As Integer = 0 To sLength - 1
Dim s_i As Char = s(lvRowIndex)
For lvColIndex As Integer = 0 To tLength - 1
If s_i = t(lvColIndex) Then
lvCost = 0
zeroCostCount += 1
Else
lvCost = 1
End If
' Step 6
Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
Next
Next
Catch ex As ThreadAbortException
Err.Clear()
Catch ex As Exception
WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
End Try
Return lvDistance - zeroCostCount
End Function
Run Code Online (Sandbox Code Playgroud)