我正在使用NLTK来分析一些经典文本而我正在努力解决逐句文本的问题.例如,这是我从Moby Dick获得的代码段:
import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'
print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''
Run Code Online (Sandbox Code Playgroud)
考虑到梅尔维尔的语法有点陈旧,我不认为这里有完美,但是NLTK应该能够处理终端双引号和"太太"之类的标题.然而,由于标记器是无人监督的训练算法的结果,我无法弄清楚如何修补它.
有人建议更好的句子标记器吗?我更喜欢一种简单的启发式方法,而不是必须训练我自己的解析器.
vpe*_*kar 45
您需要向tokenizer提供缩写列表,如下所示:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)
Run Code Online (Sandbox Code Playgroud)
句子现在是:
['is THAT what you mean, Mrs. Hussey?']
Run Code Online (Sandbox Code Playgroud)
更新:如果句子的最后一个单词附有撇号或引号(如Hussey?'),则不起作用.因此,围绕这个的快速和肮脏的方法是在撇号前面放置空格和在句末符号后面的引号(.!?):
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
Run Code Online (Sandbox Code Playgroud)
bjm*_*jmc 32
您可以修改NLTK预先训练的英语句子标记器,通过将它们添加到集合中来识别更多缩写_params.abbrev_types.例如:
extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
Run Code Online (Sandbox Code Playgroud)
请注意,必须在没有最终期限的情况下指定缩写,但必须包括任何内部句点,'i.e'如上所述.有关其他tokenizer参数的详细信息,请参阅相关文档.
PunktSentenceTokenizer.tokenize通过将realign_boundaries参数设置为,您可以告诉方法将"终端"双引号包含在句子的其余部分中True.有关示例,请参阅下面的代码.
我不知道一种干净的方法来阻止文本Mrs. Hussey被分成两个句子.但是,这是一个黑客
Mrs. Hussey到Mrs._Hussey,sent_tokenize.tokenize,Mrs._Hussey回Mrs. Hussey我希望我知道一个更好的方法,但这可能会有所帮助.
import nltk
import re
import functools
mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''
sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
sample, realign_boundaries = True)]
print u"\n-----\n".join(sentences)
Run Code Online (Sandbox Code Playgroud)
产量
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
18396 次 |
| 最近记录: |