分析单词文本的算法

Cli*_*ote 1 php java string algorithm nlp

我想要一个算法,它可以在一个文本块中创建所有可能的短语.例如,在文本中:

"My username is click upvote. I have 4k rep on stackoverflow"
Run Code Online (Sandbox Code Playgroud)

它会创建以下组合:

"My username"
"My Username is"
"username is click"
"is click"
"is click upvote"
"click upvote"
"i have"
"i have 4k"
"have 4k"
..
Run Code Online (Sandbox Code Playgroud)

你明白了.基本上,重点是从句子中获得所有可能的"短语"组合.有关如何最好地实现这一点的任何想法?

And*_*ffe 5

好吧,我不知道PHP或java,但基本上你想要在文本中的所有单词上进行双循环.这是一些伪代码:

words = split(text)
n = len(words)
for i in 1...n-1 {        // i = first word in phrase 
    for j in i+1...n {       // j = last word in phrase
        phrase = join(words[i:j])
        print phrase
    }
}
Run Code Online (Sandbox Code Playgroud)

请注意,第二个循环从i开始,而不是1.这将为您提供从单词编号i开始到单词编号j的所有短语,大于i(因此所有短语都至少包含两个单词).

啊,我刚才意识到你可能不希望短语跨越句子边界.所以你需要一个外部循环,它首先将文本分成句子,然后在每个句子上运行.

如果您有任何编程经验,这似乎很清楚,但以防万一:for语句是循环[like for(i=1; i<=n; i++)],split是一些函数,它接受一个字符串并将其拆分为一个单词数组 - 这不是完全无关紧要的,但是可能有一个库函数来执行此操作,len给出数组的长度,join将它们放在一起,中间有空格,语法[i:j]意味着所有元素从包含ij(在python中,这实际上是[i:j+1]).哦,我隐含地假设数组从索引1开始而不是零; 我将更改为基于0的C数组作为练习...

最后,回答具体问题:

  • 注意,"第二"循环实际上是一个循环; 对于i(短语的第一个单词)的每个值,我们循环i+1到句子的末尾以给出短语的最后一个单词.

  • 现在我们有了第一个和最后一个单词的数量join,你必须编写的函数 - 将各个字符串连接word[i], word[i+1], ... word[j]起来,形成短语.实际上,这可能意味着函数可以被声明为join(words, i, j)并返回字符串,尽管有些语言有办法使这更容易.

  • 如果你读了他的第一句话,你会发现他不懂PHP或Java.此外,给定的伪代码应该足够简单,可以自己转换为Java,给出一些基本的Java知识和一些搜索. (4认同)

pax*_*blo 5

基本上你需要先将文本块分成句子.即使用英语,这也很棘手,因为你需要注意句号,问号,惊叹号和任何其他句子终止符.

然后在删除所有标点符号(逗号,分号,冒号等)后一次处理一个句子.

然后,当你留下一系列单词时,它会变得更简单:

for i = 1 to num_words-1:
    for j = i+1 to num_words:
        phrase = words[i through j inclusive]
        store phrase
Run Code Online (Sandbox Code Playgroud)

就是这样,非常简单(在初始按摩文本块之后,这可能不像你想象的那么简单).

这将为您提供每个句子中两个或更多单词的所有短语.

分离成句子,分离成单词,删除标点符号等将是最难的,但我已经向你展示了一些简单的初始规则.每当文本块破坏算法时,应添加其余部分.

更新:

根据要求,这里有一些Java代码,它们提供了以下短语:

public class testme {
    public final static String text =
        "My username is click upvote." +
        " I have 4k rep on stackoverflow.";
Run Code Online (Sandbox Code Playgroud)

 

    public static void procSentence (String sent) {
        System.out.println ("==========");
        System.out.println ("sentence [" + sent + "]");

        // Split sentence at whitspace into array.

        String [] sa = sent.split("\\s+");

        // Process each starting word.

        for (int i = 0; i < sa.length - 1; i++) {

            // Process each phrase.

            for (int j = i+1; j < sa.length; j++) {

                // Build the phrase.

                String phrase = sa[i];
                for (int k = i+1; k <= j; k++) {
                    phrase = phrase + " " + sa[k];
                }

                // This is where you have your phrase. I just
                // print it out but you can do whatever you
                // wish with it.
                System.out.println ("   " + phrase);
            }
        }
    }
Run Code Online (Sandbox Code Playgroud)

 

    public static void main(String[] args) {
        // This is the block of text to process.

        String block = text;
        System.out.println ("block    [" + block + "]");

        // Keep going until no more sentences.

        while (!block.equals("")) {
            // Remove leading spaces.

            if (block.startsWith(" ")) {
                block = block.substring(1);
                continue;
            }

            // Find end of sentence.

            int pos = block.indexOf('.');

            // Extract sentence and remove it from text block.

            String sentence = block.substring(0,pos);
            block = block.substring(pos+1);

            // Process the sentence (this is the "meat").

            procSentence (sentence);

            System.out.println ("block    [" + block + "]");
        }
        System.out.println ("==========");
    }
}
Run Code Online (Sandbox Code Playgroud)

哪个输出:

block    [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
   My username
   My username is
   My username is click
   My username is click upvote
   username is
   username is click
   username is click upvote
   is click
   is click upvote
   click upvote
block    [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
   I have
   I have 4k
   I have 4k rep
   I have 4k rep on
   I have 4k rep on stackoverflow
   have 4k
   have 4k rep
   have 4k rep on
   have 4k rep on stackoverflow
   4k rep
   4k rep on
   4k rep on stackoverflow
   rep on
   rep on stackoverflow
   on stackoverflow
block    []
==========
Run Code Online (Sandbox Code Playgroud)

现在,请记住这是非常基本的Java(有些人可能会说它是用Java方言编写的C :-).这只是为了说明如何根据您的要求从句子输出单词分组.

没有完成我在原始答案中提到的所有花哨的句子检测和标点符号删除.