Why Tokenizer is keeping track of more words than num_words?

Pra*_*ari 4 python keras tensorflow

I have following code

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog',
]

tokenizer = Tokenizer(num_words=3)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)
Run Code Online (Sandbox Code Playgroud)

Output: {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Now how my code is keeping track of more than 3 unique frequent words?

Am I missing anything here?

xdu*_*ch0 6

查看源代码,似乎为遇到的所有单词分配了索引。但是,一旦您实际使用分词器将文本转换为索引序列(例如使用texts_to_sequences),所有“不常用词”都将被 OOV 标记替换。请注意,这只会在您实际指定 OOV 令牌(您没有)时完成。例子:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog dog',
]

tokenizer = Tokenizer(num_words=4, oov_token=None)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)
tokenizer.texts_to_sequences(["I love my cat"])
Run Code Online (Sandbox Code Playgroud)

{'love': 1, 'you': 6, 'i': 4, 'dog': 3, 'my': 2, 'cat': 5}
[[1, 2]]

我稍微修改了文本以打破“狗”和“我”之间的联系,并将存储的单词数量增加了一个(无论出于何种原因,指定 4 实际上只使用三个最常见的单词......)。您可以看到 OOV 词(“I”和“cat”)被简单地排除在文本之外,即使它们已分配了索引。

如果我们指定 OOV 令牌,会发生以下情况:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog dog',
]

tokenizer = Tokenizer(num_words=4, oov_token="oov")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)
tokenizer.texts_to_sequences(["I love my cat"])
Run Code Online (Sandbox Code Playgroud)

{'love': 2, 'you': 7, 'i': 5, 'dog': 4, 'my': 3, 'cat': 6, 'oov': 1}
[[1, 2, 3 , 1]]

如您所见,索引 1 现在是为 OOV 标记保留的,并且在转换时将不常用的词分配给该索引。