我有一个文本文件.我需要一个句子列表.
如何实施?有许多细微之处,例如在缩写中使用点.
我的旧正则表达式很糟糕.
re.compile('(\. |^|!|\?)([A-Z][^;?\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
Run Code Online (Sandbox Code Playgroud) 我想从一个字符串中创建一个句子列表然后将它们打印出来.我不想用NLTK来做这件事.因此,它需要在句子末尾的句点分割,而不是在小数,缩写或名称的标题上,或者如果句子有.com这是尝试正则表达式不起作用.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Run Code Online (Sandbox Code Playgroud)
示例输出的示例
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he …Run Code Online (Sandbox Code Playgroud) 如何编写一个正则表达式在python中使用来分割段落?
段落由2个换行符(\n)定义.但是,可以将任意数量的空格/制表符与换行符一起使用,并且它仍应被视为段落.
我正在使用python,因此解决方案可以使用扩展的python的正则表达式语法.(可以利用(?P...)东西)
the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']
the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']
the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']
Run Code Online (Sandbox Code Playgroud)
我能得到的最好的是:r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*',即
import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)
Run Code Online (Sandbox Code Playgroud)
但那很难看.还有什么更好的?
编辑:
r'\s*?\n\s*?\n\s*?'- >这会使示例2和3失败,因为\s包含\n,所以它允许段落中断超过2 \n秒.
让我们假设我有以下段落:
"This is the first sentence. This is the second sentence? This is the third
sentence!"
Run Code Online (Sandbox Code Playgroud)
我需要创建一个只返回给定字符数下的句子数的函数.如果小于一个句子,则返回第一个句子的所有字符.
例如:
>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
sentence!"
>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"
>>> reduce_paragraph(50)
"This is the first sentence."
>>> reduce_paragraph(5)
"This "
Run Code Online (Sandbox Code Playgroud)
我从这样的事情开始,但我似乎无法弄清楚如何完成它:
endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
if truth:
first_sentence …Run Code Online (Sandbox Code Playgroud)