对于只有 10000 个单词的字典，真正需要什么嵌入层 output_dim？

Question

对于只有 10000 个单词的字典，真正需要什么嵌入层 output_dim？

Ast*_*Ben 4 deep-learning keras tensorflow word-embedding

我正在训练一个 RNN，它的单词特征集非常少，大约 10,000 个。我计划在添加 RNN 之前从嵌入层开始，但我不清楚真正需要什么维度。我知道我可以尝试不同的值（32、64 等），但我宁愿先有一些直觉。例如，如果我使用一个 32 维的嵌入向量，那么每维只需要 3 个不同的值来完全描述空间 ( 32**3>>10000)。

或者，对于一个字数很少的空间，是否真的需要使用嵌入层，还是从输入层直接转到 RNN 更有意义？

Answer 1

mod*_*itt 6

这是一个很好的问题，但没有很好的答案。您肯定应该使用嵌入层，而不仅仅是直接使用LSTM/GRU. 但是，嵌入层的潜在维度应该“在保持峰值验证性能的同时尽可能大”。对于您大小的字典，128 或 256 应该是一个合理的决定。我怀疑你会看到截然不同的表现。

然而，这东西会真正影响在一个小数据集的结果没有使用预训练字的嵌入。这将导致您的嵌入严重过度拟合您的训练数据。我推荐使用GLove词嵌入。下载手套数据后，您可以使用它们来初始化嵌入层的权重，然后嵌入层将根据您的用例微调权重。这是我使用 Keras 进行 GloVe 嵌入的一些代码。它让你加载不同大小的它们，并缓存矩阵，以便第二次运行很快。

class GloVeSize(Enum):

    tiny = 50
    small = 100
    medium = 200
    large = 300


__DEFAULT_SIZE = GloVeSize.small


def get_pretrained_embedding_matrix(word_to_index,
                                    vocab_size=10000,
                                    glove_dir="./bin/GloVe",
                                    use_cache_if_present=True,
                                    cache_if_computed=True,
                                    cache_dir='./bin/cache',
                                    size=__DEFAULT_SIZE,
                                    verbose=1):

    """
    get pre-trained word embeddings from GloVe: https://github.com/stanfordnlp/GloVe
    :param word_to_index: a word to index map of the corpus
    :param vocab_size: the vocab size
    :param glove_dir: the dir of glove
    :param use_cache_if_present: whether to use a cached weight file if present
    :param cache_if_computed: whether to cache the result if re-computed
    :param cache_dir: the directory of the project's cache
    :param size: an enumerated choice of GloVeSize
    :param verbose: the verbosity level of logging
    :return: a matrix of the embeddings
    """
    def vprint(*args, with_arrow=True):
        if verbose > 0:
            if with_arrow:
                print(">>", *args)
            else:
                print(*args)

    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_path = os.path.join(cache_dir, 'glove_%d_embedding_matrix.npy' % size.value)
    if use_cache_if_present and os.path.isfile(cache_path):
        return np.load(cache_path)
    else:
        vprint('computing embeddings', with_arrow=True)
        embeddings_index = {}
        size_value = size.value
        f = open(os.path.join(glove_dir, 'glove.6B.' + str(size_value) + 'd.txt'),
                 encoding="ascii", errors='ignore')

        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

        f.close()
        vprint('Found', len(embeddings_index), 'word vectors.')

        embedding_matrix = np.random.normal(size=(vocab_size, size.value))

        non = 0
        for word, index in word_to_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector
            else:
                non += 1

        vprint(non, "words did not have mappings")
        vprint(with_arrow=False)

        if cache_if_computed:
            np.save(cache_path, embedding_matrix)

return embedding_matrix

Run Code Online (Sandbox Code Playgroud)

然后使用该权重矩阵实例化您的嵌入层：

 embedding_size = GloVeSize.small
    embedding_matrix = get_pretrained_embedding_matrix(data.word_to_index,
size=embedding_size)

embedding = Embedding(
     output_dim=self.embedding_size,
     input_dim=self.vocabulary_size + 1,
     input_length=self.input_length,
     mask_zero=True,
     weights=[np.vstack((np.zeros((1, self.embedding_size)),
                         self.embedding_matrix))],
     name='embedding'
)(input_layer)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	1611 次
最近记录：	7 年，4 月前