为什么在 Transformer 模型中嵌入向量乘以一个常数?

gis*_*ang 8 python deep-learning tensorflow attention-model

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.

As section Positional encoding says:

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

The positional encoding vector is added to the embedding vector.

My understanding is to add positional encoding vector directly to embedding vector. But I found embedding vector multiplied by a constant when I looked at the code.

The code in section Encoder as follows:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)


    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)
Run Code Online (Sandbox Code Playgroud)

We can see x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) before x += self.pos_encoding[:, :seq_len, :].

那么为什么在 Transformer 模型中添加位置编码之前,嵌入向量要乘以一个常数呢?

小智 9

环顾四周,我发现这个论点1

我们在添加之前增加嵌入值的原因是为了使位置编码相对较小。这意味着当我们将它们加在一起时,嵌入向量中的原始含义不会丢失。


小智 3

我相信这种缩放的原因与注意层应用的缩放无关。这可能是因为 Transformer 共享嵌入层和输出 softmax 的权重。用于嵌入的尺度与用于全连接层的尺度不同。

变压器的某些实现使用这种缩放,即使它们实际上并不共享输出层的嵌入权重,但这可能是为了一致性(或错误地)而保留在那里。只需确保嵌入的初始化是一致的。