如何调整NLTK句子标记器

Chr*_*son 34 python nlp nltk

我正在使用NLTK来分析一些经典文本而我正在努力解决逐句文本的问题.例如,这是我从Moby Dick获得的代码段:

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''
Run Code Online (Sandbox Code Playgroud)

考虑到梅尔维尔的语法有点陈旧,我不认为这里有完美,但是NLTK应该能够处理终端双引号和"太太"之类的标题.然而,由于标记器是无人监督的训练算法的结果,我无法弄清楚如何修补它.

有人建议更好的句子标记器吗?我更喜欢一种简单的启发式方法,而不是必须训练我自己的解析器.

vpe*_*kar 45

您需要向tokenizer提供缩写列表,如下所示:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)
Run Code Online (Sandbox Code Playgroud)

句子现在是:

['is THAT what you mean, Mrs. Hussey?']
Run Code Online (Sandbox Code Playgroud)

更新:如果句子的最后一个单词附有撇号或引号(如Hussey?'),则不起作用.因此,围绕这个的快速和肮脏的方法是在撇号前面放置空格和在句末符号后面的引号(.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
Run Code Online (Sandbox Code Playgroud)

  • 我通常会避免"感谢"评论,但这里确实存在:谢谢! (3认同)
  • 这个答案的问题是它没有"调整"现有的英语标记器.如果从头开始创建一个功能,您将失去许多其他功能.请参见http://stackoverflow.com/a/25375857/4582054 (3认同)

bjm*_*jmc 32

您可以修改NLTK预先训练的英语句子标记器,通过将它们添加到集合中来识别更多缩写_params.abbrev_types.例如:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
Run Code Online (Sandbox Code Playgroud)

请注意,必须在没有最终期限的情况下指定缩写,但必须包括任何内部句点,'i.e'如上所述.有关其他tokenizer参数的详细信息,请参阅相关文档.

  • 这应该是最好的答案.如果您只是创建一个新的tokenizer,您将无法获得英语tokenizer的所有现有功能. (3认同)
  • 它似乎对我不起作用,而最佳答案却起作用。 (2认同)

unu*_*tbu 8

PunktSentenceTokenizer.tokenize通过将realign_boundaries参数设置为,您可以告诉方法将"终端"双引号包含在句子的其余部分中True.有关示例,请参阅下面的代码.

我不知道一种干净的方法来阻止文本Mrs. Hussey被分成两个句子.但是,这是一个黑客

  • 轧液的所有匹配Mrs. HusseyMrs._Hussey,
  • 然后将文本拆分成句子sent_tokenize.tokenize,
  • 那么对于每个句子,unmangles Mrs._HusseyMrs. Hussey

我希望我知道一个更好的方法,但这可能会有所帮助.


import nltk
import re
import functools

mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')

sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''    

sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
    sample, realign_boundaries = True)]    

print u"\n-----\n".join(sentences)
Run Code Online (Sandbox Code Playgroud)

产量

"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
Run Code Online (Sandbox Code Playgroud)