理解 nlp 中的 torch.nn.LayerNorm

YQ.*_*ang 8 python normalization pytorch

我试图了解torch.nn.LayerNormnlp 模型的工作原理。假设输入数据是一批词嵌入序列:

batch_size, seq_size, dim = 2, 3, 4
embedding = torch.randn(batch_size, seq_size, dim)
print("x: ", embedding)

layer_norm = torch.nn.LayerNorm(dim)
print("y: ", layer_norm(embedding))

# outputs:
"""
x:  tensor([[[ 0.5909,  0.1326,  0.8100,  0.7631],
         [ 0.5831, -1.7923, -0.1453, -0.6882],
         [ 1.1280,  1.6121, -1.2383,  0.2150]],

        [[-0.2128, -0.5246, -0.0511,  0.2798],
         [ 0.8254,  1.2262, -0.0252, -1.9972],
         [-0.6092, -0.4709, -0.8038, -1.2711]]])
y:  tensor([[[ 0.0626, -1.6495,  0.8810,  0.7060],
         [ 1.2621, -1.4789,  0.4216, -0.2048],
         [ 0.6437,  1.0897, -1.5360, -0.1973]],

        [[-0.2950, -1.3698,  0.2621,  1.4027],
         [ 0.6585,  0.9811, -0.0262, -1.6134],
         [ 0.5934,  1.0505, -0.0497, -1.5942]]],
       grad_fn=<NativeLayerNormBackward0>)
"""
Run Code Online (Sandbox Code Playgroud)

根据文档的描述,我的理解是平均值和标准差是由每个样本的所有嵌入值计算的。所以我尝试手动计算y[0, 0, :]

mean = torch.mean(embedding[0, :, :])
std = torch.std(embedding[0, :, :])
print((embedding[0, 0, :] - mean) / std)
Run Code Online (Sandbox Code Playgroud)

这给出了tensor([ 0.4310, -0.0319, 0.6523, 0.6050])这不是正确的输出。我想知道正确的计算方法是什么y[0, 0, :]

B20*_*011 14

Pytorch层范数表示在最后 D 维度上计算的平均值和标准差。基于此,正如我所期望的,对于层规范,(batch_size, seq_size, embedding_dim)计算应该结束(seq_size, embedding_dim)为最后 2 个维度,不包括批量暗淡。

\n

可以在这里找到与层范数实现类似的问题和答案,Layer Normalization in pytorch?

\n

在下面的一些论文中,它展示了 NLP 中不同层规范的应用。

\n

实例、层、组范数的解释

\n

在此输入图像描述

\n

来自团体规范文件

\n
\n

层归一化 (LN) 沿通道维度运行

\n
\n
\n

LN 沿 (C, H, W)\na 轴计算每个样本的 \xc2\xb5 和 \xcf\x83。

\n
\n

在此输入图像描述

\n

不同的应用示例

\n

在NLP 3d 张量示例的pytorch 文档中,平均值和标准差仅在最后一个 dim 上计算embedding_dim

\n

在本文中,它显示了类似于 pytorch doc 示例,

\n
\n

几乎所有的NLP任务都以可变长度序列作为输入,这非常适合LN\n只计算通道维度上的统计数据,而不涉及batch和序列长度维度。

\n
\n

在此输入图像描述

\n

另一篇论文中显示的示例,

\n
\n

LN 在通道/特征维度上进行标准化,如图 1 所示。

\n
\n

在此输入图像描述

\n

仅使用 Embed Dim 的手动图层规范

\n
import torch\n\nbatch_size, seq_size, dim = 2, 3, 4\nlast_dims = 4\n\nembedding = torch.randn(batch_size, seq_size, dim)\nprint("x: ", embedding)\n\nlayer_norm = torch.nn.LayerNorm(last_dims, elementwise_affine = False)\nlayer_norm_out = layer_norm(embedding)\nprint("y: ", layer_norm_out)\n\neps: float = 0.00001\nmean = torch.mean(embedding[0, :, :], dim=(-1), keepdim=True)\nvar = torch.square(embedding[0, :, :] - mean).mean(dim=(-1), keepdim=True)\ny_custom = (embedding[0, :, :] - mean) / torch.sqrt(var + eps)\nprint("y_custom: ", y_custom)\nassert torch.allclose(layer_norm_out[0], y_custom), \'Tensors do not match.\'\n\neps: float = 0.00001\nmean = torch.mean(embedding[1, :, :], dim=(-1), keepdim=True)\nvar = torch.square(embedding[1, :, :] - mean).mean(dim=(-1), keepdim=True)\ny_custom = (embedding[1, :, :] - mean) / torch.sqrt(var + eps)\nprint("y_custom: ", y_custom)\nassert torch.allclose(layer_norm_out[1], y_custom), \'Tensors do not match.\'\n
Run Code Online (Sandbox Code Playgroud)\n

输出

\n
x:  tensor([[[-0.0594, -0.8702, -1.9837,  0.2914],\n         [-0.4774,  1.0372,  0.6425, -1.1357],\n         [ 0.3872, -0.9190, -0.5774,  0.3281]],\n\n        [[-0.5548,  0.0815,  0.2333,  0.3569],\n         [ 1.0380, -0.1756, -0.7417,  2.2930],\n         [-0.0075, -0.3623,  1.9310, -0.7043]]])\ny:  tensor([[[ 0.6813, -0.2454, -1.5180,  1.0822],\n         [-0.5700,  1.1774,  0.7220, -1.3295],\n         [ 1.0285, -1.2779, -0.6747,  0.9241]],\n\n        [[-1.6638,  0.1490,  0.5814,  0.9334],\n         [ 0.3720, -0.6668, -1.1513,  1.4462],\n         [-0.2171, -0.5644,  1.6809, -0.8994]]])\ny_custom:  tensor([[ 0.6813, -0.2454, -1.5180,  1.0822],\n        [-0.5700,  1.1774,  0.7220, -1.3295],\n        [ 1.0285, -1.2779, -0.6747,  0.9241]])\ny_custom:  tensor([[-1.6638,  0.1490,  0.5814,  0.9334],\n        [ 0.3720, -0.6668, -1.1513,  1.4462],\n        [-0.2171, -0.5644,  1.6809, -0.8994]])\n
Run Code Online (Sandbox Code Playgroud)\n

4D 张量上的手动层范数

\n
import torch\n\nbatch_size, c, h, w = 2, 3, 2, 4\nlast_dims = [c, h, w]\n\nembedding = torch.randn(batch_size, c, h, w)\nprint("x: ", embedding)\n\nlayer_norm = torch.nn.LayerNorm(last_dims, elementwise_affine = False)\nlayer_norm_out = layer_norm(embedding)\nprint("y: ", layer_norm_out)\n\n\neps: float = 0.00001\nmean = torch.mean(embedding[0, :, :], dim=(-3, -2, -1), keepdim=True)\nvar = torch.square(embedding[0, :, :] - mean).mean(dim=(-3, -2, -1), keepdim=True)\ny_custom = (embedding[0, :, :] - mean) / torch.sqrt(var + eps)\nprint("y_custom: ", y_custom)\nassert torch.allclose(layer_norm_out[0], y_custom), \'Tensors do not match.\'\n\neps: float = 0.00001\nmean = torch.mean(embedding[1, :, :], dim=(-3, -2, -1), keepdim=True)\nvar = torch.square(embedding[1, :, :] - mean).mean(dim=(-3, -2, -1), keepdim=True)\ny_custom = (embedding[1, :, :] - mean) / torch.sqrt(var + eps)\nprint("y_custom: ", y_custom)\nassert torch.allclose(layer_norm_out[1], y_custom), \'Tensors do not match.\'\n
Run Code Online (Sandbox Code Playgroud)\n

输出

\n
x:  tensor([[[[ 1.0902, -0.8648,  1.5785,  0.3087],\n          [ 0.0249, -1.3477, -0.9565, -1.5024]],\n\n         [[ 1.8024, -0.2894,  0.7284,  0.7822],\n          [ 1.4385, -0.2848, -0.3114,  0.4633]],\n\n         [[ 0.9061,  0.3066,  0.9916,  0.9284],\n          [ 0.3356,  0.9162, -0.4579,  1.0669]]],\n\n\n        [[[-0.8292,  0.9111, -0.7307, -1.1003],\n          [ 0.3441, -1.9823,  0.1313,  0.2048]],\n\n         [[-0.2838,  0.1147, -0.1605, -0.4637],\n          [-2.1343, -0.4402,  1.6685,  0.4455]],\n\n         [[ 0.6895, -2.7331,  1.1693, -0.6999],\n          [-0.3497, -0.2942, -0.0028, -1.3541]]]])\ny:  tensor([[[[ 0.8653, -1.3279,  1.4131, -0.0114],\n          [-0.3298, -1.8697, -1.4309, -2.0433]],\n\n         [[ 1.6643, -0.6824,  0.4594,  0.5198],\n          [ 1.2560, -0.6772, -0.7071,  0.1619]],\n\n         [[ 0.6587, -0.0137,  0.7547,  0.6838],\n          [ 0.0188,  0.6701, -0.8715,  0.8392]]],\n\n\n        [[[-0.4938,  1.2220, -0.3967, -0.7610],\n          [ 0.6629, -1.6306,  0.4531,  0.5256]],\n\n         [[ 0.0439,  0.4368,  0.1655, -0.1335],\n          [-1.7805, -0.1103,  1.9686,  0.7629]],\n\n         [[ 1.0035, -2.3707,  1.4764, -0.3663],\n          [-0.0211,  0.0337,  0.3210, -1.0112]]]])\ny_custom:  tensor([[[ 0.8653, -1.3279,  1.4131, -0.0114],\n         [-0.3298, -1.8697, -1.4309, -2.0433]],\n\n        [[ 1.6643, -0.6824,  0.4594,  0.5198],\n         [ 1.2560, -0.6772, -0.7071,  0.1619]],\n\n        [[ 0.6587, -0.0137,  0.7547,  0.6838],\n         [ 0.0188,  0.6701, -0.8715,  0.8392]]])\ny_custom:  tensor([[[-0.4938,  1.2220, -0.3967, -0.7610],\n         [ 0.6629, -1.6306,  0.4531,  0.5256]],\n\n        [[ 0.0439,  0.4368,  0.1655, -0.1335],\n         [-1.7805, -0.1103,  1.9686,  0.7629]],\n\n        [[ 1.0035, -2.3707,  1.4764, -0.3663],\n         [-0.0211,  0.0337,  0.3210, -1.0112]]])\n
Run Code Online (Sandbox Code Playgroud)\n

自定义层规范实现示例

\n
from typing import Union, List\n\nimport torch\n\n\nbatch_size, seq_size, embed_dim = 2, 3, 4\nembedding = torch.randn(batch_size, seq_size, embed_dim)\nprint("x: ", embedding)\nprint(embedding.shape)\nprint()\n\n\nlayer_norm = torch.nn.LayerNorm(embed_dim, elementwise_affine=False)\nlayer_norm_output = layer_norm(embedding)\nprint("y: ", layer_norm_output)\nprint(layer_norm_output.shape)\nprint()\n\n\ndef custom_layer_norm(\n        x: torch.Tensor, dim: Union[int, List[int]] = -1, eps: float = 0.00001\n) -> torch.Tensor:\n    mean = torch.mean(x, dim=(dim,), keepdim=True)\n    var = torch.square(x - mean).mean(dim=(dim,), keepdim=True)\n    return (x - mean) / torch.sqrt(var + eps)\n\n\ncustom_layer_norm_output = custom_layer_norm(embedding)\nprint("y_custom: ", custom_layer_norm_output)\nprint(custom_layer_norm_output.shape)\n\nassert torch.allclose(layer_norm_output, custom_layer_norm_output), \'Tensors do not match.\'\n
Run Code Online (Sandbox Code Playgroud)\n

输出

\n
x:  tensor([[[-0.4808, -0.1981,  0.4538, -1.2653],\n         [ 0.3578,  0.6592,  0.2161,  0.3852],\n         [ 1.2184, -0.4238, -0.3415, -0.3487]],\n\n        [[ 0.9874, -1.7737,  0.1886,  0.0448],\n         [-0.5162,  0.7872, -0.3433, -0.3266],\n         [-0.5459, -0.0371,  1.2625, -1.6030]]])\ntorch.Size([2, 3, 4])\n\ny:  tensor([[[-0.1755,  0.2829,  1.3397, -1.4471],\n         [-0.2916,  1.5871, -1.1747, -0.1208],\n         [ 1.7301, -0.6528, -0.5334, -0.5439]],\n\n        [[ 1.1142, -1.6189,  0.3235,  0.1812],\n         [-0.8048,  1.7141, -0.4709, -0.4384],\n         [-0.3057,  0.1880,  1.4489, -1.3312]]])\ntorch.Size([2, 3, 4])\n\ny_custom:  tensor([[[-0.1755,  0.2829,  1.3397, -1.4471],\n         [-0.2916,  1.5871, -1.1747, -0.1208],\n         [ 1.7301, -0.6528, -0.5334, -0.5439]],\n\n        [[ 1.1142, -1.6189,  0.3235,  0.1812],\n         [-0.8048,  1.7141, -0.4709, -0.4384],\n         [-0.3057,  0.1880,  1.4489, -1.3312]]])\ntorch.Size([2, 3, 4])\n
Run Code Online (Sandbox Code Playgroud)\n

  • 我发现了我的问题:而不是计算 `std = torch.std(embedding[0, :, :])`,我应该做 `std = torch.sqrt(torch.var(embedding[0, :, :], unbiased) =错误))`。我计算平均值和标准差的轴实际上是正确的。 (2认同)