Python在句子上分割文本

我有一个文本文件.我需要一个句子列表.

如何实施？有许多细微之处,例如在缩写中使用点.

我的旧正则表达式很糟糕.

re.compile('(\. |^|!|\?)([A-Z][^;?\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

Run Code Online (Sandbox Code Playgroud)

python text split

Art*_*yom

2011 01-02

85
推荐指数

9
解决办法

11万
查看次数

Python - 用于将文本拆分为句子的RegEx(句子标记化)

我想从一个字符串中创建一个句子列表然后将它们打印出来.我不想用NLTK来做这件事.因此,它需要在句子末尾的句点分割,而不是在小数,缩写或名称的标题上,或者如果句子有.com这是尝试正则表达式不起作用.

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

for stuff in sentences:
        print(stuff)

Run Code Online (Sandbox Code Playgroud)

示例输出的示例

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. 
Did he mind?
Adam Jones Jr. thinks he …

Run Code Online (Sandbox Code Playgroud)

python regex nlp tokenize

use*_*149

2014 09-09

22
推荐指数

3
解决办法

4万
查看次数

python正则表达式来分割段落

如何编写一个正则表达式在python中使用来分割段落？

段落由2个换行符(\n)定义.但是,可以将任意数量的空格/制表符与换行符一起使用,并且它仍应被视为段落.

我正在使用python,因此解决方案可以使用扩展的python的正则表达式语法.(可以利用(?P...)东西)

例子:

the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']

Run Code Online (Sandbox Code Playgroud)

我能得到的最好的是:r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*',即

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

Run Code Online (Sandbox Code Playgroud)

但那很难看.还有什么更好的？

编辑:

建议被拒绝:

r'\s*?\n\s*?\n\s*?'- >这会使示例2和3失败,因为\s包含\n,所以它允许段落中断超过2 \n秒.

python regex parsing text split

nos*_*klo

2017 01-18

5
推荐指数

1
解决办法

8162
查看次数

在给定字符数下返回句子的函数

让我们假设我有以下段落:

"This is the first sentence. This is the second sentence? This is the third
 sentence!"

Run Code Online (Sandbox Code Playgroud)

我需要创建一个只返回给定字符数下的句子数的函数.如果小于一个句子,则返回第一个句子的所有字符.

例如:

>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
 sentence!"

>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"

>>> reduce_paragraph(50)
"This is the first sentence."

>>> reduce_paragraph(5)
"This "

Run Code Online (Sandbox Code Playgroud)

我从这样的事情开始,但我似乎无法弄清楚如何完成它:

endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
    if truth:
        first_sentence …

Run Code Online (Sandbox Code Playgroud)

python

Dav*_*542

lucky-day

2
推荐指数

1
解决办法

1125
查看次数