在 Keras 的 tokenizer 类中使用 num_words

Aka*_*lia 3 python nlp machine-learning keras tensorflow

想了解两者之间的区别,

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 1)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
Run Code Online (Sandbox Code Playgroud)

订单/订单 - {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

对比

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
Run Code Online (Sandbox Code Playgroud)

订单/订单 - {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

如果分词器动态索引所有唯一词,有什么用num_words

Mar*_*ani 5

word_index 它只是将单词映射到整个文本语料库的 ID,无论 num_words 是什么

区别在用法上很明显。例如,如果我们调用texts_to_sequences

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 1+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[1], [1], [1]]
Run Code Online (Sandbox Code Playgroud)

只返回 love id 因为出现频率最高的词

反而

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]
Run Code Online (Sandbox Code Playgroud)

返回最频繁出现的 100 个词的 id