段落使用Stanford CoreNLP

Jul*_*les 3 java nlp stanford-nlp

有没有办法从Stanford CoreNLP中提取段落信息?我目前正在使用它从文档中提取句子,但我也有兴趣确定文档的段落结构,我理想情况下CoreNLP会为我做.我在源文档中将段落作为双换行符.我查看了CoreNLP的javadoc,似乎有一个ParagraphAnnotation类,但文档似乎没有指定它包含的内容,我看不到任何地方如何使用它的例子.谁能指出我正确的方向?

作为参考,我当前的代码是这样的:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        convertedSentences.add (new Sentence (sentence));
    }
Run Code Online (Sandbox Code Playgroud)

句子的构造函数从句子中提取单词.我如何扩展这一点以便获得额外的数据级别,即我当前的文档范围内的"convertedSentences"列表由"convertedParagraphs"列表补充,其中每个条目包含一个"convertedSentences"列表?

我尝试了对我来说最明显的方法:

List<CoreMap> paragraphs = document.get(ParagraphsAnnotation.class);
for (CoreMap paragraph : paragraphs)
{
        List<CoreMap> sentences = paragraph.get(SentencesAnnotation.class);
        List<Sentence> convertedSentences = new ArrayList<> ();
        for (CoreMap sentence : sentences)
        {
            convertedSentences.add (new Sentence (sentence));
        }

        convertedParagraphs.add (new Paragraph (convertedSentences));
}
Run Code Online (Sandbox Code Playgroud)

但这不起作用,所以我想我误解了一下这应该是怎么回事.

Jul*_*les 6

似乎ParagraphsAnnotationCoreNLP中存在一个类是一个红色的鲱鱼 - 实际上并没有使用这个类(参见http://grepcode.com/search/usages?type=type&id=repo1.maven.org%24maven2@edu.stanford. nlp%24stanford-corenlp @ 3.2.0 @ edu%24stanford%24nlp%24ling @ CoreAnnotations.ParagraphsAnnotation&k = u - 从字面上看,除了它的定义之外,没有对这个类的引用).因此,我必须自己打破段落.

关键是要注意包含在SentencesAnnotation其中的每个句子包含一个CharacterOffsetBeginAnnotation.然后我的代码变成这样:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        int sentenceOffsetStart = sentence.get (CharacterOffsetBeginAnnotation.class);
        if (sentenceOffsetStart > 1 && text.substring (sentenceOffsetStart - 2, sentenceOffsetStart).equals("\n\n") && !convertedSentences.isEmpty ())
        {
            Paragraph current = new Paragraph (convertedSentences);
            paragraphs.add (current);
            convertedSentences = new ArrayList<> ();
        }           
        convertedSentences.add (new Sentence (sentence));
    }
    Paragraph current = new Paragraph (convertedSentences);
    paragraphs.add (current);
Run Code Online (Sandbox Code Playgroud)