如何从预训练的词嵌入数据集创建 Keras 嵌入层？

Question

如何从预训练的词嵌入数据集创建 Keras 嵌入层？

Mar*_*ace 5 python word2vec keras tensorflow word-embedding

如何将预训练的词嵌入加载到 KerasEmbedding层中？

我glove.6B.50d.txt从https://nlp.stanford.edu/projects/glove/下载了（glove.6B.zip 文件），但我不确定如何将它添加到 Keras 嵌入层。请参阅：https : //keras.io/layers/embeddings/

Answer 1

Mar*_*ace 8

您需要将 embeddingMatrix 传递给Embedding图层，如下所示：

Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

vocabLen：词汇表中的记号数
embDim：嵌入向量维度（在您的示例中为 50）
embeddingMatrix: 从 glove.6B.50d.txt 构建的嵌入矩阵
isTrainable：您是否希望嵌入可训练或冻结层

这glove.6B.50d.txt是一个以空格分隔的值列表：单词标记 + (50) 个嵌入值。例如the 0.418 0.24968 -0.41242 ...

要从pretrainedEmbeddingLayerGlove 文件创建一个：

# Prepare Glove File
def readGloveFile(gloveFile):
    with open(gloveFile, 'r') as f:
        wordToGlove = {}  # map from a token (word) to a Glove embedding vector
        wordToIndex = {}  # map from a token to an index
        indexToWord = {}  # map from an index to a token 

        for line in f:
            record = line.strip().split()
            token = record[0] # take the token (word) from the text line
            wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)

        tokens = sorted(wordToGlove.keys())
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
            wordToIndex[tok] = kerasIdx # associate an index to a token (word)
            indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above

    return wordToIndex, indexToWord, wordToGlove

# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
    vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
    embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)

    embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
    for word, index in wordToIndex.items():
        embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding

    embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
    return embeddingLayer

# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...

Run Code Online (Sandbox Code Playgroud)

该死的，事情变化很快！我认为在最新版本中应该使用 `embeddings_initializer=Constant(embeddingMatrix)` (3认同)
我的印象是，如果你检查这个（https://keras.io/layers/embeddings/）和这个（https://github.com/张量流/张量流/问题/14392） (2认同)

归档时间：	8 年，4 月前
查看次数：	6524 次
最近记录：	5 年，1 月前