如何使用斯坦福解析器将文本拆分成句子？

Question

如何使用斯坦福解析器将文本拆分成句子？

S G*_*ber 26 java parsing nlp artificial-intelligence stanford-nlp

如何使用斯坦福解析器将文本或段落分割成句子？

是否有任何方法可以提取句子,例如getSentencesFromString()为Ruby提供的句子？

Answer 1

您可以检查DocumentPreprocessor类.以下是一个简短的片段.我想可能还有其他方法可以做你想做的事.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}

Run Code Online (Sandbox Code Playgroud)

我通过使用增强的for循环简化了代码,并在Sentence类中使用了一个方便的方法,它将一个标记列表转换回一个String. (8认同)

Answer 2

Kev*_*vin 24

我知道已经有一个公认的答案......但通常你只是从一个带注释的文档中获取SentenceAnnotations.

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

Run Code Online (Sandbox Code Playgroud)

来源 - http://nlp.stanford.edu/software/corenlp.shtml (一半)

如果你只是在寻找句子,你可以从管道初始化中删除后面的步骤,如"parse"和"dcoref",这样可以节省一些负载和处理时间.摇滚乐.〜ķ

Answer 3

dan*_*ton 16

接受的答案有几个问题.首先,tokenizer转换一些字符,例如字符"into two characters``.其次,将标记化文本与空格连接在一起并不会返回与之前相同的结果.因此,来自接受的答案的示例文本以非平凡的方式转换输入文本.

但是,CoreLabeltokenizer使用的类会跟踪它们映射到的源字符,因此如果您有原始字符串,则重建正确的字符串是微不足道的.

下面的方法1显示了接受的答案方法,方法2显示了我的方法,它克服了这些问题.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

Run Code Online (Sandbox Code Playgroud)

这输出:

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.

Run Code Online (Sandbox Code Playgroud)

Answer 4

Yan*_*v.H 8

使用.net C#包:这将拆分句子,使括号正确并保留原始空格和标点符号:

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}

Run Code Online (Sandbox Code Playgroud)

输入: - 这句话的人物具有一定的魅力,常见于标点符号和散文.这是第二句话？确实是.

输出: 3个句子('？'被认为是句末分隔符)

注意:对于像"Havisham夫人的课程在所有方面都无可挑剔(就人们可以看到!)这样的句子而言." 标记器将正确地识别出太太结束时的时间段不是EOS,但它会错误地标记!在括号内作为EOS并在"所有方面"分开.作为第二句话.

归档时间：	13 年，8 月前
查看次数：	33042 次
最近记录：	7 年，2 月前