使用NLTK的Punkt Tokenizer保留空行

duh*_*ime 7 python nlp newline line-breaks nltk

我正在使用NLTK的PUNKT句子标记器将文件拆分成句子列表,并希望保留文件中的空行:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences
Run Code Online (Sandbox Code Playgroud)

我想要打印:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)

但实际打印的内容显示已从第一句和第三句中删除尾随空行:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)

NLTK中的其他标记器有一个blanklines='keep'参数,但在Punkt标记生成器的情况下我没有看到任何这样的选项.我很可能错过了一些简单的东西.有没有办法使用Punkt句子标记器重新训练这些尾随的空行?我很感激别人可以提供的任何见解!

Hug*_*hot 11

问题

遗憾的是,你不能让标记器保留空白线,而不是它的写入方式.

从这里开始并通过span_tokenize()和_slices_from_text()调用函数,你可以看到有一个条件

if match.group('next_tok'):

设计用于确保标记生成器跳过空格,直到下一个可能的句子开始标记出现.寻找正则表达式,我们最终查看_period_context_fmt,在那里我们看到next_tok命名组前面有\s+,其中不会捕获blanklines.

解决方案

分解,改变你不喜欢的部分,重新​​组装你的自定义解决方案.

现在这个正则表达式在PunktLanguageVars类中,它本身用于初始化PunktSentenceTokenizer类.我们只需要从Pu​​nktLanguageVars派生一个自定义类,并按照我们希望的方式修复正则表达式.

我们想要的修复是在句子末尾包含尾随换行符,所以我建议替换它_period_context_fmt,从这里开始:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""
Run Code Online (Sandbox Code Playgroud)

对此:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""
Run Code Online (Sandbox Code Playgroud)

现在,使用此正则表达式而不是旧正则表达式的标记生成器将\s在句子结尾后包含0个或更多字符.

整个剧本

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))
Run Code Online (Sandbox Code Playgroud)

这输出:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)