从字符串匹配的英语词典单词

chr*_*hrr 5 .net c# linq

我试图弄清楚从字典文件到给定字符串中识别英语单词的最佳匹配问题.

例如("lines"是字典单词列表):

string testStr = "cakeday";

for (int x= 0; x<= testStr.Length; x++)
{
  string test = testStr.Substring(x);

   if (test.Length > 0)
   {
      string test2 = testStr.Remove(counter);
      int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
      Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
    }
}
Run Code Online (Sandbox Code Playgroud)

给出输出:

Test: cakeday /   : 0
Test: akeday / c  : 1
Test: keday / ca  : 0
Test: eday / cak  : 0
Test: day / cake  : 2
Test: ay / caked  : 1
Test: y / cakeda  : 1
Run Code Online (Sandbox Code Playgroud)

很明显,"day/cake"是最适合字符串的,但是如果我要在字符串中引入第三个单词,例如"cakedaynow"它就不能很好地工作.

我知道这个例子是原始的,它更像是一个概念证明,并且想知道是否有人对这种类型的字符串分析有任何经验?

谢谢!

Jam*_*See 2

您需要研究适合您想要做的事情的算法类别。从维基百科上的近似字符串匹配开始。

另外,这里有一个用 C# 编写的 Levenshtein 编辑距离实现,可以帮助您入门:

using System;

namespace StringMatching
{
    /// <summary>
    /// A class to extend the string type with a method to get Levenshtein Edit Distance.
    /// </summary>
    public static class LevenshteinDistanceStringExtension
    {
        /// <summary>
        /// Get the Levenshtein Edit Distance.
        /// </summary>
        /// <param name="strA">The current string.</param>
        /// <param name="strB">The string to determine the distance from.</param>
        /// <returns>The Levenshtein Edit Distance.</returns>
        public static int GetLevenshteinDistance(this string strA, string strB)
        {
            if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
                return 0;

            if (string.IsNullOrEmpty(strA))
                return strB.Length;

            if (string.IsNullOrEmpty(strB))
                return strA.Length;

            int[,] deltas; // matrix
            int lengthA;
            int lengthB;
            int indexA;
            int indexB;
            char charA;
            char charB;
            int cost; // cost

            // Step 1
            lengthA = strA.Length;
            lengthB = strB.Length;

            deltas = new int[lengthA + 1, lengthB + 1];

            // Step 2
            for (indexA = 0; indexA <= lengthA; indexA++)
            {
                deltas[indexA, 0] = indexA;
            }

            for (indexB = 0; indexB <= lengthB; indexB++)
            {
                deltas[0, indexB] = indexB;
            }

            // Step 3
            for (indexA = 1; indexA <= lengthA; indexA++)
            {
                charA = strA[indexA - 1];

                // Step 4
                for (indexB = 1; indexB <= lengthB; indexB++)
                {
                    charB = strB[indexB - 1];

                    // Step 5
                    if (charA == charB)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }

                    // Step 6
                    deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
                }
            }

            // Step 7
            return deltas[lengthA, lengthB];
        }
    }
}
Run Code Online (Sandbox Code Playgroud)