了解推理、反向传播和模型更新期间发生内存分配的原因

Amb*_*ose 6 gpu pytorch

在追踪 GPU OOM 错误的过程中,我在 Pytorch 代码(在 Google Colab P100 上运行)中做了以下检查点:

learning_rate = 0.001
num_epochs = 50

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print('check 1')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

model = MyModel()

print('check 2')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

model = model.to(device)

print('check 3')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

print('check 4')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

for epoch in range(num_epochs):
    train_running_loss = 0.0
    train_accuracy = 0.0

    model = model.train()

    print('check 5')
    !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

    ## training step
    for i, (name, output_array, input) in enumerate(trainloader):
        
        output_array = output_array.to(device)
        input = input.to(device)
        comb = torch.zeros(1,1,100,1632).to(device)

        print('check 6')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        ## forward + backprop + loss
        output = model(input, comb)

        print('check 7')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        loss = my_loss(output, output_array)

        print('check 8')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        optimizer.zero_grad()

        print('check 9')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        loss.backward()

        print('check 10')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        ## update model params
        optimizer.step()

        print('check 11')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        train_running_loss += loss.detach().item()

        print('check 12')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        temp = get_accuracy(output, output_array)

        print('check 13')
        !nvidia-smi | grep MiB | awk '{print $9 $10 $11}'

        train_accuracy += temp     
Run Code Online (Sandbox Code Playgroud)

输出如下:

check 1
2MiB/16160MiB
check 2
2MiB/16160MiB
check 3
3769MiB/16160MiB
check 4
3769MiB/16160MiB
check 5
3769MiB/16160MiB
check 6
3847MiB/16160MiB
check 7
6725MiB/16160MiB
check 8
6725MiB/16160MiB
check 9
6725MiB/16160MiB
check 10
9761MiB/16160MiB
check 11
16053MiB/16160MiB
check 12
16053MiB/16160MiB
check 13
16053MiB/16160MiB
check 6
16053MiB/16160MiB
check 7
16071MiB/16160MiB
check 8
16071MiB/16160MiB
check 9
16071MiB/16160MiB
check 10
16071MiB/16160MiB
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-f566d09448f9> in <module>()
     65 
     66         ## update model params
---> 67         optimizer.step()
     68 
     69         print('check 11')

3 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
     86                 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
     87                 with torch.autograd.profiler.record_function(profile_name):
---> 88                     return func(*args, **kwargs)
     89             return wrapper
     90 

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
    116                    lr=group['lr'],
    117                    weight_decay=group['weight_decay'],
--> 118                    eps=group['eps'])
    119         return loss

/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
     92             denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
     93         else:
---> 94             denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)
     95 
     96         step_size = lr / bias_correction1

RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 15.78 GiB total capacity; 11.91 GiB already allocated; 182.75 MiB free; 14.26 GiB reserved in total by PyTorch)
Run Code Online (Sandbox Code Playgroud)

model = model.to(device)创建 3.7G 内存对我来说是有意义的。

但为什么运行模型会output = model(input, comb)另外创建 3G 内存呢?
然后loss.backward()再创建3G内存?
然后optimizer.step()再创建6.3G内存?

如果有人能解释 PyTorch GPU 内存分配模型在此示例中如何工作,我将不胜感激。

Iva*_*van 7

    \n
  • 推理

    \n

    默认情况下,模型的推理将分配内存来存储每层的激活(激活中间层输入)。这是反向传播所必需的,其中这些张量用于计算梯度。一个简单但有效的例子是由 定义的函数f: x -> x\xc2\xb2。在这里,df/dx = 2x为了计算,df/dx您需要保留x在内存中。

    \n

    如果您使用torch.no_grad()上下文管理器,您将允许 PyTorch保存这些值,从而节省内存。这在评估或测试模型时特别有用,执行反向传播时)特别有用。当然,你在训练期间将无法使用它!

    \n
  • \n
  • 向后传播

    \n

    向后传递调用将在设备上分配额外的内存来存储每个参数的梯度值。只有叶张量节点(模型参数和输入)将其梯度存储在grad属性中。这就是为什么内存使用量仅在推理和计算之间增加的原因backward调用之间增加的原因。

    \n
  • \n
  • 模型参数更新

    \n

    由于您使用的是有状态优化器(Adam),因此需要一些额外的内存来保存一些参数。阅读相关的 PyTorch 论坛帖子。如果您尝试使用无状态优化器(例如SGD),则不应有任何内存开销step

    \n
  • \n
\n
\n

所有三个步骤都可能有内存需求。总之,设备上分配的内存实际上取决于三个要素:

\n
    \n
  1. 神经网络的大小:模型越大,内存中保存的层激活和梯度就越多。

    \n
  2. \n
  3. 无论您是否在上下文中torch.no_grad:在这种情况下,只有模型的状态需要存储在内存中(不需要激活或梯度)。

    \n
  4. \n
  5. 使用的优化器类型:是否是有状态的(在参数更新期间保存一些运行估计),还是无状态的(不需要)。

    \n
  6. \n
\n

您是否需要退货

\n