import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]
Run Code Online (Sandbox Code Playgroud)
看看它如何变成"双重``和''?
这里发生了什么事?为什么要改变这个角色?有修复吗?因为我需要稍后在字符串中搜索每个标记.
Python 2.7.6如果它有任何区别.
alv*_*vas 13
TL; DR:
nltk.word_tokenize更改从双引号开始更改" -> ``和结束双引号" -> ''.
长期:
首先,nltk.word_tokenize基于Penn TreeBank如何被标记化的标记,来自nltk.tokenize.treebank,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py#L91和https://github.com/ NLTK/NLTK/BLOB /开发/ NLTK /记号化/ treebank.py#L23
class TreebankWordTokenizer(TokenizerI):
"""
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
Run Code Online (Sandbox Code Playgroud)
接下来是https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48上的收缩正则替换列表,它来自" Robert MacIntyre的tokenizer ",即https:/ /www.cis.upenn.edu/~treebank/tokenizer.sed
收缩分裂像"要去","想要"等词:
>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']
Run Code Online (Sandbox Code Playgroud)
之后我们到达你要问的标点部分,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:
def tokenize(self, text):
#starting quotes
text = re.sub(r'^\"', r'``', text)
text = re.sub(r'(``)', r' \1 ', text)
text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)
Run Code Online (Sandbox Code Playgroud)
啊哈,起始引号改为" - >``:
>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'
Run Code Online (Sandbox Code Playgroud)
然后我们看到https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85处理结束引号:
#ending quotes
text = re.sub(r'"', " '' ", text)
text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)
Run Code Online (Sandbox Code Playgroud)
应用正则表达式:
>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "
Run Code Online (Sandbox Code Playgroud)
因此,如果您想在之后搜索令牌列表中的双引号nltk.word_tokenize,只需搜索``而''不是搜索".