Python:从行提取句子 - 基于标准需要正则表达式

Dar*_*nes 5 python regex

这里有点蟒蛇/编程新手......

我试图想出一个正则表达式,它可以处理从文本文件中的一行中提取句子,然后将它们附加到列表中.代码:

import re

txt_list = []

with open('sample.txt', 'r') as txt:
    patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?'
    read_txt = txt.readlines()

    for line in read_txt:
        if line == "\n":
            txt_list.append("\n")
        else: 
            found = re.findall(patt, line)
            for f in found:
                txt_list.append(f)


for line in txt_list:
    if line == "\n":
        print "newline"
    else:
        print line
Run Code Online (Sandbox Code Playgroud)

根据上述代码的最后5行打印输出:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.
Run Code Online (Sandbox Code Playgroud)

'sample.txt'的内容:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

I am the {very last|last} sentence for this {instance|example}.
Run Code Online (Sandbox Code Playgroud)

我现在已经玩了几个小时的正则表达式,我似乎无法破解它.正如它所说,正则表达式最终不匹配for lunch?.因此这两句话What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.没有分开; 这就是我想要的.

正则表达式的一些重要细节:

  • 每个句子总是以句号,感叹号或问号结束
  • 每个句子总是包含至少一对大括号"{}",其中包含一些单词.此外,不会产生误导性的"." 在每个句子的最后一个括号之后.因此,Dr.总是会在每个句子的最后一对花括号之前.这就是为什么我试图使用'}'来建立我的正则表达式.这样我可以尽量避免使用产生这样的语法例外情况的例外的做法,Dr.,Jr.,approx.等等.对于我运行此代码的每个文件,我个人确保在任何句子中的最后一个'}'之后没有"误导期".

我想要的输出是这样的:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.
Run Code Online (Sandbox Code Playgroud)

小智 2

我得到的最直观的解决方案是这样的。本质上,您需要将Dr.Mr.令牌本身视为原子。

patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'
Run Code Online (Sandbox Code Playgroud)

分解后,它说:

找到最少数量的Mr.s、Dr.s 或标点符号之前的任何字符,后跟零或一个空格,后跟零或一个新行。

当用于这个sample.txt时(我添加了一行):

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

But there are no {misters|doctors} here good sir! Help us if there is an emergency.

I am the {very last|last} sentence for this {instance|example}.
Run Code Online (Sandbox Code Playgroud)

它给:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.
Run Code Online (Sandbox Code Playgroud)