将中文文档拆分成句子

Question

将中文文档拆分成句子

pje*_*has 5 nlp tokenize stanford-nlp sentence

我必须将中文文本分成多个句子.我试过Stanford DocumentPreProcessor.它适用于英语,但不适用于中文.

请你能告诉我任何中文优秀的句子分割器,最好用Java或Python.

Answer 1

alv*_*vas 6

在 Python 中使用一些正则表达式技巧（参见http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf第 2.3 节的修改正则表达式）：

\n\n

import re\n\nparagraph = u\'\\u70ed\\u5e26\\u98ce\\u66b4\\u5c1a\\u5854\\u5c14\\u662f2001\\u5e74\\u5927\\u897f\\u6d0b\\u98d3\\u98ce\\u5b63\\u7684\\u4e00\\u573a\\u57288\\u6708\\u7a7f\\u8d8a\\u4e86\\u52a0\\u52d2\\u6bd4\\u6d77\\u7684\\u5317\\u5927\\u897f\\u6d0b\\u70ed\\u5e26\\u6c14\\u65cb\\u3002\\u5c1a\\u5854\\u5c14\\u4e8e8\\u670814\\u65e5\\u7531\\u70ed\\u5e26\\u5927\\u897f\\u6d0b\\u7684\\u4e00\\u80a1\\u4e1c\\u98ce\\u6ce2\\u53d1\\u5c55\\u800c\\u6210\\uff0c\\u5176\\u5b58\\u5728\\u7684\\u5927\\u90e8\\u5206\\u65f6\\u95f4\\u91cc\\u90fd\\u5728\\u5feb\\u901f\\u5411\\u897f\\u79fb\\u52a8\\uff0c\\u9000\\u5316\\u6210\\u4e1c\\u98ce\\u6ce2\\u540e\\u7a7f\\u8d8a\\u4e86\\u5411\\u98ce\\u7fa4\\u5c9b\\u3002\'\n\ndef zng(paragraph):\n    for sent in re.findall(u\'[^!?\xe3\x80\x82\\.\\!\\?]+[!?\xe3\x80\x82\\.\\!\\?]?\', paragraph, flags=re.U):\n        yield sent\n\nlist(zng(paragraph))\n

Run Code Online (Sandbox Code Playgroud)\n\n

正则表达式解释： https://regex101.com/r/eNFdqM/2

\n\n

\n

Answer 2

Joh*_*ohn 2

对于未分段的文本，使用斯坦福图书馆，您可能想使用他们的中文 CoreNLP。这不像基本 corenlp 那样有很好的记录，但它可以满足您的任务。

http://nlp.stanford.edu/software/corenlp-faq.shtml#languages http://nlp.stanford.edu/software/corenlp.shtml

您将需要分段器和句子分割器。“segment, ssplit” 其他不相关。

或者，您可以直接使用 edu.stanford.nlp.process.WordToSentenceSplitter 中的 WordToSentenceSplitter 类。如果你这样做了，你可以看看它是如何在 WordsToSentencesAnnotator 中使用的。

归档时间：	11 年前
查看次数：	2453 次
最近记录：	8 年，4 月前