识别文本中的句子

hen*_*nry 3 python regex string python-3.x

我在正确识别文本中特定极端情况的句子时遇到了一些麻烦:

  1. 如果涉及到点、点、点,则不保留。
  2. 如果"涉及的话。
  3. 如果句子不小心以小写开头。

到目前为止,这就是我识别文本中句子的方法(来源:字幕重新格式化以完整句子结尾):

re.findall部分基本上查找str以大写字母 开头的块,[A-Z]然后是除标点符号之外的任何内容,然后以标点符号结尾[\.?!]

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")
Run Code Online (Sandbox Code Playgroud)
We were able to respond to the first research question.

Next, we also determined the size of the population.
Run Code Online (Sandbox Code Playgroud)

极端情况 1:点、点、点

点,点,点不会被保留,因为没有给出如果三个点连续出现该怎么办的说明。这怎么能改变呢?

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
Run Code Online (Sandbox Code Playgroud)
We were able to respond to the first research question.

Next, we also determined the size of the population.
Run Code Online (Sandbox Code Playgroud)

极端情况 2:

"符号成功地保留在句子中,但就像标点符号后面的点一样,它会在最后被删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
Run Code Online (Sandbox Code Playgroud)
We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.
Run Code Online (Sandbox Code Playgroud)

极端情况 3:句子开头小写

如果句子不小心以小写开头,则该句子将被忽略。目的是确定前一个句子结束(或文本刚刚开始),因此必须开始一个新句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
Run Code Online (Sandbox Code Playgroud)

We were able to respond to the first research question.

编辑

我测试了一下:

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
Run Code Online (Sandbox Code Playgroud)

...但我得到:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the
dependency parser, or set sentence boundaries by setting
doc[i].is_sent_start.
Run Code Online (Sandbox Code Playgroud)

Blu*_*ken 5

您可以修改您的正则表达式以匹配您的极端情况。

首先,你不需要逃.进里面[]

对于第一个极端情况,您可以贪婪地将结尾句子标记与[.!?]*

对于第二个,您可以"在之后匹配[.!?]

对于最后一个,您可以用 upper 或 lower 开始句子:

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
Run Code Online (Sandbox Code Playgroud)

解释

  • [A-z],每场比赛都应该以一个字母开头,无论是大写还是小写。
  • [^.?!]*,它贪婪地匹配任何不是.?!(结束句字符)的字符
  • [.?!]*,它贪婪地匹配结尾字符,因此...??!!???将作为句子的一部分进行匹配
  • "?,它最终匹配句子末尾的引用

极端情况 1:

我们能够回答第一个研究问题……接下来,我们还确定了人口规模。

极端情况 2:

我们能够回答第一个“研究”问题:“这是什么?” 接下来,我们还确定了人口规模。

极端情况 3:

我们能够回答第一个研究问题。接下来,我们还确定了人口规模。