Art*_*yom 85 python text split
我有一个文本文件.我需要一个句子列表.
如何实施?有许多细微之处,例如在缩写中使用点.
我的旧正则表达式很糟糕.
re.compile('(\. |^|!|\?)([A-Z][^;?\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
Run Code Online (Sandbox Code Playgroud)
Ned*_*der 132
自然语言工具包(nltk.org)拥有您所需要的. 此组发布表明这样做:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
Run Code Online (Sandbox Code Playgroud)
(我还没试过!)
小智 81
这个函数可以在大约0.1秒内将Huckleberry Finn的整个文本分成句子并处理许多令人痛苦的边缘案例,使句子解析变得非常重要,例如" John Johnson Jr.先生出生在美国但获得了博士学位. D.在以色列加入耐克公司担任工程师之前.他还曾在craigslist.org担任商业分析师. "
# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
def split_into_sentences(text):
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
Run Code Online (Sandbox Code Playgroud)
Has*_*aza 30
您也可以使用nltk库,而不是使用正则表达式将文本拆分为句子.
>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
Run Code Online (Sandbox Code Playgroud)
参考:https://stackoverflow.com/a/9474645/2877052
小智 9
这是一种不依赖任何外部库的道路中间方法.我使用列表推导来排除缩写和终止符之间的重叠,以及排除终止变体之间的重叠,例如:'.' 与'.''
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
Run Code Online (Sandbox Code Playgroud)
我在这个条目中使用了Karl的find_all函数: 在Python中查找所有出现的子字符串
我非常喜欢 spaCy,但我最近发现了两种新的句子标记化方法。一个是Microsoft 的BlingFire(速度极快),另一个是AI2 的PySBD(极其准确)。
text = ...
from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')
from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)
Run Code Online (Sandbox Code Playgroud)
我使用五种不同的方法分离了 20k 个句子。以下是 AMD Threadripper Linux 计算机上的运行时间:
更新:我尝试在全小写文本上使用 BlingFire,但失败得很惨。我暂时将在我的项目中使用 PySBD。
对于简单的情况(句子正常终止),这应该起作用:
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
Run Code Online (Sandbox Code Playgroud)
regex是*\. +,它匹配一个由左0或更多空格和右1或更多空格包围的句点(以防止将re.split的句点计为句子的变化)。
显然,这不是最可靠的解决方案,但在大多数情况下都可以。唯一不能解决的情况就是缩写(也许遍历句子列表并检查每个字符串是否以sentences大写字母开头?)
您可以尝试使用Spacy代替正则表达式。我用它就可以了。
import spacy
nlp = spacy.load('en')
text = '''Your text here'''
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
Run Code Online (Sandbox Code Playgroud)
您还可以在 NLTK 中使用句子标记化功能:
from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes. Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."
sent_tokenize(sentence)
Run Code Online (Sandbox Code Playgroud)
小智 6
使用spacy:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
114369 次 |
| 最近记录: |