我目前正在使用Keras Tokenizer创建一个单词索引,然后将该单词索引与导入的GloVe字典进行匹配,以创建嵌入矩阵.然而,我遇到的问题是,这似乎打败了使用单词向量嵌入的一个优点,因为当使用训练模型进行预测时,如果它遇到一个不在标记化器的单词索引中的新单词,则会将其从序列中删除.
#fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
values = line.split()
word = values[0]
value = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = value
f.close()
#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros. …Run Code Online (Sandbox Code Playgroud)