双引号的NLTK字令牌化行为令人困惑

Mot*_*sim 12 python nltk

import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]
Run Code Online (Sandbox Code Playgroud)

看看它如何变成"双重``和''

这里发生了什么事?为什么要改变这个角色?有修复吗?因为我需要稍后在字符串中搜索每个标记.

Python 2.7.6如果它有任何区别.

alv*_*vas 13

TL; DR:

nltk.word_tokenize更改从双引号开始更改" -> ``和结束双引号" -> ''.


长期:

首先,nltk.word_tokenize基于Penn TreeBank如何被标记化的标记,来自nltk.tokenize.treebank,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py#L91https://github.com/ NLTK/NLTK/BLOB /开发/ NLTK /记号化/ treebank.py#L23

class TreebankWordTokenizer(TokenizerI):
    """
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
    This is the method that is invoked by ``word_tokenize()``.  It assumes that the
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
Run Code Online (Sandbox Code Playgroud)

接下来是https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48上的收缩正则替换列表,它来自" Robert MacIntyre的tokenizer ",即https:/ /www.cis.upenn.edu/~treebank/tokenizer.sed

收缩分裂像"要去","想要"等词:

>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']
Run Code Online (Sandbox Code Playgroud)

之后我们到达你要问的标点部分,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:

def tokenize(self, text):
    #starting quotes
    text = re.sub(r'^\"', r'``', text)
    text = re.sub(r'(``)', r' \1 ', text)
    text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)
Run Code Online (Sandbox Code Playgroud)

啊哈,起始引号改为" - >``:

>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'
Run Code Online (Sandbox Code Playgroud)

然后我们看到https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85处理结束引号:

    #ending quotes
    text = re.sub(r'"', " '' ", text)
    text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)
Run Code Online (Sandbox Code Playgroud)

应用正则表达式:

>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "
Run Code Online (Sandbox Code Playgroud)

因此,如果您想在之后搜索令牌列表中的双引号nltk.word_tokenize,只需搜索``''不是搜索".