如何测量2个字符串之间的相似度?

Zan*_*oni 55 c# string comparison phonetics

鉴于两个字符串text1text2

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}
Run Code Online (Sandbox Code Playgroud)

例子:

  1. 第一个字符串:StackOverflow

    第二个字符串:StaqOverflow

    回报:相似度为91%

    返回可以是%或类似的东西.

  2. 第一个字符串:简单的文本测试

    第二个字符串:复杂的文本测试

    返回:可以认为这些值相等

有任何想法吗?做这个的最好方式是什么?

Jon*_*eet 42

有各种不同的方法可以做到这一点.查看Wikipedia"字符串相似性度量"页面,了解其他带算法的页面的链接.

我不认为这些算法中的任何算法都会考虑声音 - 因此,"staq overflow"与"staw overflow"类似于"堆栈溢出",尽管第一个在发音方面更相似.

我刚刚发现另一个页面提供了更多的选项...特别是,Soundex算法(维基百科)可能更接近你所追求的.

  • 仅供参考,如果您正在使用SQL Server处理数据,它有一个SOUNDEX()函数. (8认同)
  • 此外,应该注意的是,Soundex是一种旧算法,主要用于英语单词.如果您需要多语言的现代算法,请考虑使用Metaphone.本文将更详细地讨论这些差异:http://www.informit.com/articles/article.aspx?p = 1848528 (2认同)

Lir*_*una 27

Levenshtein距离可能就是你想要的.


Thu*_*rGr 14

这是我为我正在研究的项目编写的一些代码.我需要根据字符串的单词知道字符串的相似比和相似比.最后一个,我想知道最小字符串的单词相似度(所以如果所有单词都存在并且在较大的字符串中匹配,结果将是100%)和较大字符串的单词相似度比率(我称之为RealWordsRatio) ).我使用Levenshtein算法来找到距离.到目前为止,代码未被优化,但它按预期工作.希望对你有帮助.

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }
Run Code Online (Sandbox Code Playgroud)


ane*_*son 5

在C#中写了一个Double Metaphone实现.你会发现它远远优于Soundex等.

Levenshtein距离也被提出,它是很多用途的很好的算法,但语音匹配并不是它真正的作用; 它似乎只是因为语音相似的词通常也拼写相似.我对各种模糊匹配算法进行了分析,您可能也会发现它们很有用.