use*_*149 22 python regex nlp tokenize
我想从一个字符串中创建一个句子列表然后将它们打印出来.我不想用NLTK来做这件事.因此,它需要在句子末尾的句点分割,而不是在小数,缩写或名称的标题上,或者如果句子有.com这是尝试正则表达式不起作用.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Run Code Online (Sandbox Code Playgroud)
示例输出的示例
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
Run Code Online (Sandbox Code Playgroud)
vks*_*vks 31
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Run Code Online (Sandbox Code Playgroud)
试试这个.拆分你的字符串.你也可以查看演示.
http://regex101.com/r/nG1gU7/27
smc*_*mci 27
好的,所以我使用正则表达式,nltk,CoreNLP来详细查看了句子标记符.你最终编写自己的,这取决于应用程序.这些东西很棘手且很有价值,而且人们不只是将其令牌化代码放弃.(最终,标记化不是一个确定性的过程,它是概率性的,并且还非常依赖于您的语料库或域,例如社交媒体帖子与Yelp评论vs ...)
一般来说,你不能依赖一个单一的Great White绝对可靠的正则表达式,你必须编写一个使用几个正则数据(正面和负面)的函数; 还有一个缩写词典,以及一些基本的语言解析,它们知道例如"我","美国","FCC","TARP"都是用英语大写的.
为了说明这是多么容易变得非常复杂,让我们试着写一个确定性标记化器的功能规范,只是为了决定单个或多个句点('.'/'...')是否表示句末,或者某事其他:
function isEndOfSentence(leftContext, rightContext)
在简单(确定性)的情况下,function isEndOfSentence(leftContext, rightContext)将返回布尔值,但在更一般意义上,它是概率性的:它返回一个浮点数0.0-1.0(特定'.'是一个句子结束的置信度).
参考文献:[a] Coursera视频:"基本文本处理2-5 - 句子分割 - 斯坦福NLP - 丹·朱拉夫斯基教授和克里斯·曼宁" [更新:曾经在YouTube上的非官方版本被删除]
尝试根据空格而不是点或来分割输入?,如果您喜欢这样做,则点或?将不会打印在最终结果中。
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
39497 次 |
| 最近记录: |