我现在正在尝试不同的技术,将头撞在墙上一段时间.
他们都没有运作良好.
我有两个字符串.
我需要比较它们并获得确切的匹配百分比,
即."四分和七年前"为了"scor and sevn yeres ago"
好吧,我首先将每个单词与每个单词进行比较,跟踪每个单词,以及百分比= count \numOfWords.不,没有考虑拼写错误的单词.
("四"<>"为"即使它很近)
然后我开始尝试比较每个char中的每个char,如果不匹配则递增字符串char(计算拼写错误).但是,我会得到错误的命中,因为第一个字符串可能在第二个字符串中包含每个字符,但不是第二个字符串的确切顺序.("东西可用"<>"stu vail"(但它会回来,低百分比,但是命中.9\11 = 81%))
所以,我然后尝试比较每个字符串中的字符对.如果string1 [i] = string2 [k] AND string1 [i + 1] = string2 [k + 1],则递增计数,并在不匹配时递增"k"(以跟踪误导."for"和"四分"应该以75%的命中率回归.)这似乎也不起作用.它越来越近,但即使完全匹配,它也只会返回94%.当事情真的拼写错误时,它真的搞砸了.(底部代码)
有什么想法或指示吗?
码
count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
' To ignore non letters or digits '
If Not strTempName(j).IsLetter(strTempName(j)) Then
j += 1
End If
' To ignore non letters or digits '
If Not strTempFile(k).IsLetter(strTempFile(k)) Then
k += 1
End If
' compare pair of chars '
While (strTempName(j) <> strTempFile(k) And _
strTempName(j + 1) <> strTempFile(k + 1) And _
k < strTempFile.Length - 2)
k += 1
End While
count += 1
j += 1
k += 1
End While
perc = count / (strTempName.Length - 1)
Run Code Online (Sandbox Code Playgroud)
编辑:我一直在做一些研究,我想我最初从这里找到了代码,并在几年前将其翻译为vbnet.它使用Levenshtein字符串匹配算法.
这是我用于此的代码,希望它有所帮助:
Sub Main()
Dim string1 As String = "four score and seven years ago"
Dim string2 As String = "for scor and sevn yeres ago"
Dim similarity As Single =
GetSimilarity(string1, string2)
' RESULT : 0.8
End Sub
Public Function GetSimilarity(string1 As String, string2 As String) As Single
Dim dis As Single = ComputeDistance(string1, string2)
Dim maxLen As Single = string1.Length
If maxLen < string2.Length Then
maxLen = string2.Length
End If
If maxLen = 0.0F Then
Return 1.0F
Else
Return 1.0F - dis / maxLen
End If
End Function
Private Function ComputeDistance(s As String, t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim distance As Integer(,) = New Integer(n, m) {}
' matrix
Dim cost As Integer = 0
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
'init1
Dim i As Integer = 0
While i <= n
distance(i, 0) = System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Max(System.Threading.Interlocked.Increment(j), j - 1)
End While
'find min distance
For i = 1 To n
For j = 1 To m
cost = (If(t.Substring(j - 1, 1) = s.Substring(i - 1, 1), 0, 1))
distance(i, j) = Math.Min(distance(i - 1, j) + 1, Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost))
Next
Next
Return distance(n, m)
End Function
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4222 次 |
| 最近记录: |