我必须剪切一个unicode字符串,这实际上是一篇文章(包含句子)我想在python中的第X个句子后剪切这篇文章字符串.
句子结尾的一个好指标是它以句号结束(".")和以大写字母开头后的单词.如
myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third."
Run Code Online (Sandbox Code Playgroud)
怎么能实现这一目标?
谢谢
Tim*_*ara 15
考虑下载Natural Language Toolkit(NLTK).然后你可以创建一些句子,这些句子不会像"USA"这样的东西中断,或者不能分割以"?!"结尾的句子.
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third."
>>> sentences = nltk.sent_tokenize(paragraph)
[u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]
Run Code Online (Sandbox Code Playgroud)
您的代码变得更具可读性.要访问第二句,请使用您习惯的符号.
>>> sentences[1]
u"And this is my second."
Run Code Online (Sandbox Code Playgroud)