查找在字符串中多次使用过的短语

pow*_*tte 0 c# algorithm text

通过使用字典来识别最常使用的单词,但是给定文本文件,可以很容易地计算文件中单词的出现次数,如何找到常用短语,其中"短语"是两个或更多个连续的集合话?

例如,以下是一些示例文本:

除口头遗嘱外,每一份遗嘱均为书面形式,但可以是手写或打字.遗嘱应包含遗嘱人的签名 或其他人在遗嘱人的有意识的存在 和立遗嘱人的明确指示下.遗嘱应由立遗嘱人在有意识的存在下,由两名或多名主管证人进行证明和认购,他们看到立遗嘱人认购,或听取立遗嘱人确认立遗嘱人的签名.

出于本节的目的,有意识的存在意味着在任何立遗嘱人的感官范围内,不包括通过电话,电子或其他远程通信感知的视觉或声音.

我怎样才能识别出"有意识存在"(3次)和"立遗嘱人签名"(2次)的短语不止一次出现(除了蛮力搜索每一组两三个单词)?

我将用c#编写这个,所以c#代码会很棒,但是我甚至无法确定一个好的算法,所以我会解决任何代码甚至伪代码以解决这个问题.

gun*_*171 5

试试吧.它绝不是万无一失的,但是现在应该完成工作.

是的,这只匹配双字组合,不会去除标点符号,而且是强力的.不,ToList没有必要.

string text = "that big long text block";

var splitBySpace = text.Split(' ');

var doubleWords = splitBySpace
    .Select((x, i) => new { Value = x, Index = i })
    .Where(x => x.Index != splitBySpace.Length - 1)
    .Select(x => x.Value + " " + splitBySpace.ElementAt(x.Index + 1)).ToList();

var duplicates = doubleWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();
Run Code Online (Sandbox Code Playgroud)

我得到了以下结果:

在此输入图像描述


这是我试图获得超过2个单词组合.再次,与之前相同的警告.

List<string> multiWords = new List<string>();

//i is the number of words to combine
//in this case, 2-6 words
for (int i = 2; i <= 6; i++)
{
    multiWords.AddRange(splitBySpace
        .Select((x, index) => new { Value = x, Index = index })
        .Where(x => x.Index != splitBySpace.Length - i + 1)
        .Select(x => CombineItems(splitBySpace, x.Index, x.Index + i - 1)));
}

var duplicates = multiWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();

private string CombineItems(IEnumerable<string> source, int startIndex, int endIndex)
{
    return string.Join(" ", source.Where((x, i) => i >= startIndex && i <= endIndex).ToArray());
}
Run Code Online (Sandbox Code Playgroud)

结果这次:
在此输入图像描述

现在我只想说我的代码很可能出现一个错误.我没有对它进行全面测试,因此请确保在使用之前对其进行测试.


mp3*_*ret 5

以为我会快速解决这个问题 - 不确定这不是你想要避免的蛮力方法 - 但是:

static void Main(string[] args)
{
    string txt = @"Except oral wills, every will shall be in writing, 
but may be handwritten or typewritten. The will shall contain the testator's 
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";

    //split string using common seperators - could add more or use regex.
    string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');

    //trim each tring and get rid of any empty ones
    words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();

    const int MaxPhraseLength = 20;

    Dictionary<string, int> Counts = new Dictionary<string,int>();

    for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
    {
        for (int i = 0; i < words.Length - 1; i++)
        {
            //get the phrase to match based on phraselen
            string[] phrase = GetPhrase(words, i, phraseLen);
            string sphrase = string.Join(" ", phrase);

            Console.WriteLine("Phrase : {0}", sphrase);

            int index = FindPhraseIndex(words, i+phrase.Length, phrase);

            if (index > -1)
            {
                Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);

                if(!Counts.ContainsKey(sphrase))
                    Counts.Add(sphrase, 1);

                Counts[sphrase]++;
            }
        }
    }

    foreach (var foo in Counts)
    {
        Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
    }

    Console.ReadKey();
}

static string[] GetPhrase(string[] words, int startpos, int len)
{
    return words.Skip(startpos).Take(len).ToArray();
}

static int  FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
    for (int i = startIndex; i < words.Length; i++)
    {
        int j;

        for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
            if(matchWords[j]!=words[i+j])
                break;

        if (j == matchWords.Length)
            return startIndex;
    }

    return -1;
}
Run Code Online (Sandbox Code Playgroud)