Python在句子上分割文本

Art*_*yom 85 python text split

我有一个文本文件.我需要一个句子列表.

如何实施?有许多细微之处,例如在缩写中使用点.

我的旧正则表达式很糟糕.

re.compile('(\. |^|!|\?)([A-Z][^;?\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
Run Code Online (Sandbox Code Playgroud)

Ned*_*der 132

自然语言工具包(nltk.org)拥有您所需要的. 此组发布表明这样做:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
Run Code Online (Sandbox Code Playgroud)

(我还没试过!)

  • 您可能必须首先执行`nltk.download()`并下载模型 - >`punkt` (9认同)
  • @Artyom:这里是[`nltk .tokenize.punkt.PunktSentenceTokenizer`]在线文档的直接链接(http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class的.html). (4认同)
  • @Artyom:它可能适用于俄语 - 请参阅[每种语言可以NLTK/pyNLTK工作"(即非英语),以及如何使用?](http://stackoverflow.com/questions/1795410/can-nltk- pynltk工作每语言即,非英语和如何). (3认同)
  • 为了节省一些输入:`import nltk`然后`nltk.sent_tokenize(string)` (2认同)
  • 对于带引号结尾的情况,此操作将失败。如果我们的句子结尾像“ this”。 (2认同)
  • 好吧,你说服了我。但我刚刚测试了一下,似乎并没有失败。我的输入是“这在以引号结尾的情况下失败。如果我们有一个以“this”结尾的句子。这是另一个句子。'` 我的输出是 `['这在以引号结尾的情况下失败。', '如果我们有一个以“this”结尾的句子。', '这是另一个句子。']` 似乎对我来说是正确的。 (2认同)

小智 81

这个函数可以在大约0.1秒内将Huckleberry Finn的整个文本分成句子并处理许多令人痛苦的边缘案例,使句子解析变得非常重要,例如" John Johnson Jr.先生出生在美国但获得了博士学位. D.在以色列加入耐克公司担任工程师之前.他还曾在craigslist.org担任商业分析师. "

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences
Run Code Online (Sandbox Code Playgroud)

  • 这是一个很棒的解决方案.但是我在正则表达式的声明中添加了两行以上:= =([0-9])"和text = re.sub(digits +"[.]"+ digits,"\\ 1 <prd>\\ 2",文本)中的函数.现在它不会将行分为小数,例如5.5.谢谢你的回答. (12认同)
  • 一个很好的解决方 在函数中,我添加了如果文本中的"eg":text = text.replace("eg","e <prd> g <prd>")如果文本中的"ie":text = text.replace("ie" ,"我<prd> e <prd>")它完全解决了我的问题. (4认同)
  • 你是如何解析整个哈克贝利鳍的?哪里有文本格式的? (2认同)
  • 很棒的解决方案,非常有用 只是为了让它更健壮一点:`前缀="(先生| St |太太|女士|博士|教授|上尉| Cpt | Lt | Mt)[.]"`,`sites ="[.](com | net | org | io | gov | me | edu)"`和`if"......"in text:text = text.replace("...","<prd> <prd> <prd>" )` (2认同)

Has*_*aza 30

您也可以使用nltk库,而不是使用正则表达式将文本拆分为句子.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
Run Code Online (Sandbox Code Playgroud)

参考:https://stackoverflow.com/a/9474645/2877052

  • 我发现 nltk.tokenize.sent_tokenize 在找到 ie、eg 等和其他缩写时会给出错误的分割句子的结果。 (2认同)

小智 9

这是一种不依赖任何外部库的道路中间方法.我使用列表推导来排除缩写和终止符之间的重叠,以及排除终止变体之间的重叠,例如:'.' 与'.''

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)
Run Code Online (Sandbox Code Playgroud)

我在这个条目中使用了Karl的find_all函数: 在Python中查找所有出现的子字符串


mph*_*mph 9

我非常喜欢 spaCy,但我最近发现了两种新的句子标记化方法。一个是Microsoft 的BlingFire(速度极快),另一个是AI2 的PySBD(极其准确)。

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)
Run Code Online (Sandbox Code Playgroud)

我使用五种不同的方法分离了 20k 个句子。以下是 AMD Threadripper Linux 计算机上的运行时间:

  • spaCy Sentencizer:1.16934s
  • spaCy 解析:25.97063s
  • PySBD:9.03505s
  • NLTK:0.30512s
  • BlingFire:0.07933s

更新:我尝试在全小写文本上使用 BlingFire,但失败得很惨。我暂时将在我的项目中使用 PySBD。


Raf*_*ler 6

对于简单的情况(句子正常终止),这应该起作用:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
Run Code Online (Sandbox Code Playgroud)

regex是*\. +,它匹配一个由左0或更多空格和右1或更多空格包围的句点(以防止将re.split的句点计为句子的变化)。

显然,这不是最可靠的解决方案,但在大多数情况下都可以。唯一不能解决的情况就是缩写(也许遍历句子列表并检查每个字符串是否以sentences大写字母开头?)

  • 您能想到英语中句子不以句号结尾的情况吗?想象一下!我的回答是,“再想一想”。(看看我在那里做了什么?) (28认同)

Elf*_*Elf 6

您可以尝试使用Spacy代替正则表达式。我用它就可以了。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())
Run Code Online (Sandbox Code Playgroud)

  • 空间是巨大的。但是如果你只需要分成句子将文本传递到空格将花费太长时间如果你正在处理数据管道 (2认同)
  • 另外,对于 AWS Lambda Serverless 用户来说,spacy 的支持数据文件有很多 100MB(英文大文件&gt; 400MB),所以你不能直接使用这样的东西,非常遗憾(这里是 Spacy 的超级粉丝) (2认同)

woo*_*ody 6

您还可以在 NLTK 中使用句子标记化功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)
Run Code Online (Sandbox Code Playgroud)


小智 6

使用spacy

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())
Run Code Online (Sandbox Code Playgroud)