duh*_*ime 7 python nlp newline line-breaks nltk
我正在使用NLTK的PUNKT句子标记器将文件拆分成句子列表,并希望保留文件中的空行:
from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences
Run Code Online (Sandbox Code Playgroud)
我想要打印:
['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)
但实际打印的内容显示已从第一句和第三句中删除尾随空行:
['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)
NLTK中的其他标记器有一个blanklines='keep'参数,但在Punkt标记生成器的情况下我没有看到任何这样的选项.我很可能错过了一些简单的东西.有没有办法使用Punkt句子标记器重新训练这些尾随的空行?我很感激别人可以提供的任何见解!
Hug*_*hot 11
遗憾的是,你不能让标记器保留空白线,而不是它的写入方式.
从这里开始并通过span_tokenize()和_slices_from_text()调用函数,你可以看到有一个条件
if match.group('next_tok'):
设计用于确保标记生成器跳过空格,直到下一个可能的句子开始标记出现.寻找正则表达式,我们最终查看_period_context_fmt,在那里我们看到next_tok命名组前面有\s+,其中不会捕获blanklines.
分解,改变你不喜欢的部分,重新组装你的自定义解决方案.
现在这个正则表达式在PunktLanguageVars类中,它本身用于初始化PunktSentenceTokenizer类.我们只需要从PunktLanguageVars派生一个自定义类,并按照我们希望的方式修复正则表达式.
我们想要的修复是在句子末尾包含尾随换行符,所以我建议替换它_period_context_fmt,从这里开始:
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
\s+(?P<next_tok>\S+) # or whitespace and some other token
))"""
Run Code Online (Sandbox Code Playgroud)
对此:
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
\s* # <-- THIS is what I changed
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
(?P<next_tok>\S+) # <-- Normally you would have \s+ here
))"""
Run Code Online (Sandbox Code Playgroud)
现在,使用此正则表达式而不是旧正则表达式的标记生成器将\s在句子结尾后包含0个或更多字符.
import nltk.tokenize.punkt as pkt
class CustomLanguageVars(pkt.PunktLanguageVars):
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
\s* # <-- THIS is what I changed
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
(?P<next_tok>\S+) # <-- Normally you would have \s+ here
))"""
custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
print(custom_tknzr.tokenize(s))
Run Code Online (Sandbox Code Playgroud)
这输出:
['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1137 次 |
| 最近记录: |