密集合成器的实现

Question

密集合成器的实现

alv*_*vas 14 python transformer-model neural-network deep-learning pytorch

我正在尝试理解 Synthesizer 论文 ( https://arxiv.org/pdf/2005.00743.pdf 1) 并且有对密集合成器机制的描述，该机制应该取代 Transformer 架构中描述的传统注意力模型。

所述致密合成被描述为这样的：

所以我试图实现这个层，它看起来像这样，但我不确定我是否做对了：

class DenseSynthesizer(nn.Module):
    def __init__(self, l, d):
        super(DenseSynthesizer, self).__init__()
        self.linear1 = nn.Linear(d, l)
        self.linear2 = nn.Linear(l, l)

    def forward(self, x, v):
        # Equation (1) and (2)
        # Shape: l x l
        b = self.linear2(F.relu(self.linear1(x)))   
        # Equation (3)
        # [l x l] x [l x d] -> [l x d]
        return torch.matmul(F.softmax(b), v)

Run Code Online (Sandbox Code Playgroud)

用法：

l, d = 4, 5

x, v =  torch.rand(l, d), torch.rand(l, d)

synthesis = DenseSynthesizer(l, d)
synthesis(x, v)

Run Code Online (Sandbox Code Playgroud)

例子：

x 和 v 是张量：

x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
         [0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
         [0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
         [0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])

v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
         [0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
         [0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
         [0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])

Run Code Online (Sandbox Code Playgroud)

并通过密集合成的前向传递，它返回：

>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v) 

tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
        [0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
        [0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
        [0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)

Run Code Online (Sandbox Code Playgroud)

密集合成器的实现和理解是否正确？

从理论上讲，这与接收两个不同输入并在前向传播的不同点使用它的多层感知器有何不同？