格式化Stanford Corenlp的NER输出

Question

格式化Stanford Corenlp的NER输出

我正在与Stanford CoreNLP合作并将其用于NER.但是当我提取组织名称时,我看到每个单词都用注释标记.因此,如果该实体是"纽约时报",那么它将被记录为三个不同的实体:"NEW","YORK"和"TIMES".我们可以在Stanford COreNLP中设置一个属性,以便我们可以将组合输出作为实体吗？

就像在Stanford NER中一样,当我们使用命令行实用程序时,我们可以选择输出格式为:inlineXML？我们可以以某种方式设置属性来选择Stanford CoreNLP中的输出格式吗？

Answer 1

Cha*_*ker 5

如果您只想要斯坦福NER找到的每个命名实体的完整字符串,请尝试以下方法:

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

Run Code Online (Sandbox Code Playgroud)

如果您想知道,实体类由表示entity.first.

或者,您可以使用ner.classifyWithInlineXML(text)以获得看起来像的输出<PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

归档时间：	11 年前
查看次数：	2107 次
最近记录：	8 年，1 月前