sur*_*gle 6 sql-server compare soundex levenshtein-distance
在我的应用程序中,我需要通过搜索姓氏和名字来识别一个人。 一个要求是在一定程度上接受拼写错误。
\n\n我尝试通过名字和姓氏来识别一个人的身份:
\n\n屏幕截图包含一些测试记录和我的 sql 查询的结果,其中包括每列的 soundex 值和 LD
\n\n\n\n我当前的查询如下所示
\n\nSELECT t2.*\n , t1.Firstname + \' \' + t1.Lastname as SourceName\n , \'Torsten Mueller\' as TargetName\n , dbo.FUNC_LEVENSHTEIN(t1.Firstname +\' \'+ t1.Lastname\n , \'Torsten Mueller\', 8) as LEVENSHTEIN_Distance \n FROM #TestSoundex t1\n LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id\n WHERE t1.Soundex_Firstname = SOUNDEX(\'Torsten\')\n AND t1.Soundex_Lastname = SOUNDEX(\'Mueller\')\nRun Code Online (Sandbox Code Playgroud)\n\n正如您所看到的,我首先通过 soundex 过滤结果,并计算剩余记录的编辑距离。在下面的示例中,编辑距离范围从 0(两个字符串相等)到 3。
\n\nSourceName | TargetName | Levenshtein Distance \nThorsten M\xc3\xbcller | Torsten Mueller | 3 \nTorsten M\xc3\xbcller | Torsten Mueller | 2\nThorsten Mueller | Torsten Mueller | 1\nTorsten Mueller | Torsten Mueller | 0\nRun Code Online (Sandbox Code Playgroud)\n\n在斯坦福大学教授的演讲中,解释了距离的计算:
\n\nI N T E * N TION \n| | | | | | | \n* E X E C U TION\nd s s i s\nRun Code Online (Sandbox Code Playgroud)\n\n每次删除d、插入i增加 1 分,替换s增加 2 分。\nLD-function对于上面的示例,i 使用返回 5 分,但对于Thorsten M\xc3\xbcller和之间的距离仅返回 3,而不是 4 Torsten Mueller。\你
+1 point to delete h, \n+1 point instead of 2 to substitute \xc3\xbc \n+1 point to insert e\nRun Code Online (Sandbox Code Playgroud)\n\n所以我添加了一些示例
\n\n\n\n我的印象是,soundex 和 LD 都不足以唯一地识别给定的人员记录firstname,并且lastname考虑到可能存在拼写不匹配。
\xc3\xbc,\xc3\xb6,\xc3\xa4以便我可以更好地理解计算吗?distance给定字符串s和的情况下找到名字和姓氏的正确匹配项t,它应该基于两个字符串的长度吗numberOrCharacters(s+t)/2 = max?这是我从链接的答案中使用的功能。我只是将函数名称从 更改edit_distance_within为FUNC_LEVENSHTEIN
SET QUOTED_IDENTIFIER ON \nGO\nSET ANSI_NULLS ON \nGO\n\nCREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)\nRETURNS int\nAS\nBEGIN\n DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,\n @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int\n SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = \'\', @j = 1, @i = 1, @c = 0\n WHILE @j <= @tl\n SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1\n WHILE @i <= @sl\n BEGIN\n SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = \'\', @j = 1, @cmin = 4000\n WHILE @j <= @tl\n BEGIN\n SET @c = @c + 1\n SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END\n IF @c > @c1 SET @c = @c1\n SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1\n IF @c > @c1 SET @c = @c1\n IF @c < @cmin SET @cmin = @c\n SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1\n END\n IF @cmin > @d BREAK\n SELECT @cv1 = @cv0, @i = @i + 1\n END\n RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END\nEND\nGO\nRun Code Online (Sandbox Code Playgroud)\n\n这是另一个测试
\n\nCREATE TABLE #TestLevenshteinDistance(\n Id int IDENTITY(1,1) NOT NULL,\n SourceName nvarchar(100) NULL, \n Soundex_SourceName varchar(4) NULL, \n Targetname nvarchar(100) NULL, \n Soundex_TargetName varchar(4) NULL, \n ); \n\nINSERT INTO #TestLevenshteinDistance \n ( SourceName, \n Soundex_SourceName,\n Targetname,\n Soundex_TargetName) \nVALUES \n (\'Intention\',SOUNDEX(\'Intention\'), \'Execution\', SOUNDEX(\'Execution\')), \n (\'Karsten\' , SOUNDEX(\'Karsten\'), \'Torsten\', SOUNDEX(\'Torsten\')); \n\n\nSELECT t1.*\n , dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance\n FROM #TestLevenshteinDistance t1\nRun Code Online (Sandbox Code Playgroud)\n