Levenshtein在T-SQL中的距离

Question

Levenshtein在T-SQL中的距离

Ale*_*yev 76 t-sql edit-distance levenshtein-distance

我对计算Levenshtein距离的T-SQL算法感兴趣.

Answer 1

我在TSQL中实现了标准的Levenshtein编辑距离函数,并进行了几次优化,提高了我所知道的其他版本的速度.在两个字符串在其开始处具有共同字符的情况下(共享前缀),在它们的末尾共享字符(共享后缀),并且当字符串很大并且提供最大编辑距离时,速度的提高是显着的.例如,当输入是两个非常相似的4000个字符串,并且指定了最大编辑距离2时,这几乎比edit_distance_within在接受的答案中起作用,在0.073秒(73毫秒)和55秒之间返回答案.它还具有内存效率,使用的空间等于两个输入字符串中较大的一个加上一些常量空间.它使用表示列的单个nvarchar"数组",并在其中进行所有计算,以及一些辅助int变量.

优化:

跳过共享前缀和/或后缀的处理
如果较大的字符串以较小的字符串开头或结尾,则提前返回
如果尺寸差异保证最大距离将被超过,则提前返回
仅使用表示矩阵中的列的单个数组(实现为nvarchar)
当给出最大距离时,时间复杂度从(len1*len2)变为(min(len1,len2))即线性
当给出最大距离时,一旦知道最大距离界限就不能实现早期返回

我在TSQL中关于Levenshtein的博客文章中更详细地描述了优化,并将链接链接到另一个具有类似Damerau-Levenshtein实现的帖子.但这里是代码(2014年1月20日更新,以加快它的速度):

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
        SELECT @s = @t, @sLen = @tLen
        SELECT @t = @v0, @tLen = @i
    END
    SELECT @max = ISNULL(@max, @tLen)
         , @lenDiff = @tLen - @sLen
    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SELECT @sLen = @sLen - 1, @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SELECT @start = @start + 1
    IF (@start > 1) BEGIN
        SELECT @sLen = @sLen - (@start - 1)
             , @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SELECT @s = SUBSTRING(@s, @start, @sLen)
             , @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SELECT @v0 = '', @j = 1
    WHILE (@j <= @tLen) BEGIN
        SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SELECT @j = @j + 1
    END

    SELECT @jOffset = @max - @lenDiff
         , @i = 1
    WHILE (@i <= @sLen) BEGIN
        SELECT @distance = @i
             , @diag = @i - 1
             , @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
             , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
             , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
                 , @thisJ = @j
            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                END    END
            SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
                 , @diag = @left
                 , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
        END
        SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

Run Code Online (Sandbox Code Playgroud)

正如此函数的注释中所提到的,字符比较的区分大小写将遵循有效的排序规则.默认情况下,SQL Server的排序规则会导致不区分大小写的比较.修改此函数以始终区分大小写的一种方法是将特定的排序规则添加到比较字符串的两个位置.但是,我没有对此进行全面测试,尤其是在数据库使用非默认排序规则时的副作用.以下是如何更改这两行以强制进行区分大小写的比较:

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)

Run Code Online (Sandbox Code Playgroud)

和

            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag                    --match, no change

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ale*_*yev 57

阿诺德Fribble对两个提案sqlteam.com/forums

一个从2005年6月和
另一个更新的2006年5月

这是2006年的年轻人:

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO

CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
    WHILE @j <= @tl
    BEGIN
      SET @c = @c + 1
      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
      IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c
      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
    END
    IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO

Run Code Online (Sandbox Code Playgroud)

@Lieven:这不是我的实施,作者是Arnold Fribble.@d参数是字符串之间的最大允许差异,它们被认为太多样化并且函数返回-1.这是因为T-SQL中的算法运行得太慢了. (2认同)

Answer 3

Lea*_*pez 13

IIRC,使用SQL Server 2005及更高版本,您可以使用任何.NET语言编写存储过程:在SQL Server 2005中使用CLR集成.有了它,编写一个计算Levenstein距离的程序应该不难.

一个简单的Hello,World!从帮助中提取:

using System;
using System.Data;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;

public class HelloWorldProc
{
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void HelloWorld(out string text)
    {
        SqlContext.Pipe.Send("Hello world!" + Environment.NewLine);
        text = "Hello world!";
    }
}

Run Code Online (Sandbox Code Playgroud)

然后在SQL Server中运行以下命令:

CREATE ASSEMBLY helloworld from 'c:\helloworld.dll' WITH PERMISSION_SET = SAFE

CREATE PROCEDURE hello
@i nchar(25) OUTPUT
AS
EXTERNAL NAME helloworld.HelloWorldProc.HelloWorld

Run Code Online (Sandbox Code Playgroud)

现在你可以测试运行它:

DECLARE @J nchar(25)
EXEC hello @J out
PRINT @J

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助.

Answer 4

Era*_*per 7

您可以使用Levenshtein距离算法来比较字符串

在这里,您可以在http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx找到T-SQL示例.

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

Run Code Online (Sandbox Code Playgroud)

(Joseph Gama开发的功能)

用法:

select
 dbo.edit_distance('Fuzzy String Match','fuzzy string match'),
 dbo.edit_distance('fuzzy','fuzy'),
 dbo.edit_distance('Fuzzy String Match','fuzy string match'),
 dbo.edit_distance('levenshtein distance sql','levenshtein sql server'),
 dbo.edit_distance('distance','server')

Run Code Online (Sandbox Code Playgroud)

该算法简单地返回stpe计数,通过一步替换不同的字符将一个字符串更改为另一个字符串

归档时间：	16 年，11 月前
查看次数：	58839 次
最近记录：	6 年，3 月前