小编Nic*_*ade的帖子

使用keras tokenizer处理不在训练集中的新单词

我目前正在使用Keras Tokenizer创建一个单词索引,然后将该单词索引与导入的GloVe字典进行匹配,以创建嵌入矩阵.然而,我遇到的问题是,这似乎打败了使用单词向量嵌入的一个优点,因为当使用训练模型进行预测时,如果它遇到一个不在标记化器的单词索引中的新单词,则会将其从序列中删除.

#fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    value = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = value
f.close()

#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros. …

Run Code Online (Sandbox Code Playgroud)

python nlp machine-learning deep-learning keras

Nic*_*ade

2018 01-25

14
推荐指数

2
解决办法

2920
查看次数