使用Stanford NLP进行文本标记化:过滤不需要的单词和字符

Question

使用Stanford NLP进行文本标记化:过滤不需要的单词和字符

dmi*_*ony 6 java machine-learning tokenize stanford-nlp

我Stanford NLP在分类工具中用于字符串标记化.我想唯一有意义的话,但我得到的非字标记(如---,>,.等),而不是重要的话像am,is,to(停用词).有人知道解决这个问题的方法吗？

Answer 1

在stanford Corenlp中,有一个禁用词删除注释器,它提供了删除标准停用词的功能.您也可以根据需要在这里定义自定义停用词(即---,<,等)

你可以在这里看到这个例子:

   Properties props = new Properties();
   props.put("annotators", "tokenize, ssplit, stopword");
   props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");

   StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(example);
   pipeline.annotate(document);
   List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);

Run Code Online (Sandbox Code Playgroud)

在上面的示例中,"tokenize,ssplit,stopwords"被设置为自定义停用词.

希望它能帮助你...... !!

如何下载 intoxicant.analytics.coreNlp.StopwordAnnotator 的 jar 文件？ (2认同)

Answer 2

Jon*_*ier 5

这是一项非常特定于领域的任务，我们不会在 CoreNLP 中为您执行。您应该能够使用正则表达式过滤器和停用词来完成这项工作CoreNLP 分词器之上的过滤器。

这是英文停用词的示例列表。

归档时间：	10 年，8 月前
查看次数：	6240 次
最近记录：	7 年，3 月前