在追踪 GPU OOM 错误的过程中,我在 Pytorch 代码(在 Google Colab P100 上运行)中做了以下检查点:
learning_rate = 0.001
num_epochs = 50
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('check 1')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
model = MyModel()
print('check 2')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
model = model.to(device)
print('check 3')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print('check 4')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
for epoch in range(num_epochs):
train_running_loss = 0.0
train_accuracy = 0.0
model = model.train()
print('check 5')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
## training step
for i, (name, output_array, input) in enumerate(trainloader):
output_array = output_array.to(device)
input = input.to(device)
comb = torch.zeros(1,1,100,1632).to(device)
print('check 6')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
## forward + backprop + loss
output = model(input, comb)
print('check 7')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
loss = my_loss(output, output_array)
print('check 8')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
optimizer.zero_grad()
print('check 9')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
loss.backward()
print('check 10')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
## update model params
optimizer.step()
print('check 11')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
train_running_loss += loss.detach().item()
print('check 12')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
temp = get_accuracy(output, output_array)
print('check 13')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
train_accuracy += temp
Run Code Online (Sandbox Code Playgroud)
输出如下:
check 1
2MiB/16160MiB
check 2
2MiB/16160MiB
check 3
3769MiB/16160MiB
check 4
3769MiB/16160MiB
check 5
3769MiB/16160MiB
check 6
3847MiB/16160MiB
check 7
6725MiB/16160MiB
check 8
6725MiB/16160MiB
check 9
6725MiB/16160MiB
check 10
9761MiB/16160MiB
check 11
16053MiB/16160MiB
check 12
16053MiB/16160MiB
check 13
16053MiB/16160MiB
check 6
16053MiB/16160MiB
check 7
16071MiB/16160MiB
check 8
16071MiB/16160MiB
check 9
16071MiB/16160MiB
check 10
16071MiB/16160MiB
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-f566d09448f9> in <module>()
65
66 ## update model params
---> 67 optimizer.step()
68
69 print('check 11')
3 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
86 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
87 with torch.autograd.profiler.record_function(profile_name):
---> 88 return func(*args, **kwargs)
89 return wrapper
90
/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
26 def decorate_context(*args, **kwargs):
27 with self.__class__():
---> 28 return func(*args, **kwargs)
29 return cast(F, decorate_context)
30
/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
116 lr=group['lr'],
117 weight_decay=group['weight_decay'],
--> 118 eps=group['eps'])
119 return loss
/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
92 denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
93 else:
---> 94 denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)
95
96 step_size = lr / bias_correction1
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 15.78 GiB total capacity; 11.91 GiB already allocated; 182.75 MiB free; 14.26 GiB reserved in total by PyTorch)
Run Code Online (Sandbox Code Playgroud)
model = model.to(device)创建 3.7G 内存对我来说是有意义的。
但为什么运行模型会output = model(input, comb)另外创建 3G 内存呢?
然后loss.backward()再创建3G内存?
然后optimizer.step()再创建6.3G内存?
如果有人能解释 PyTorch GPU 内存分配模型在此示例中如何工作,我将不胜感激。
推理
\n默认情况下,模型的推理将分配内存来存储每层的激活(激活如中间层输入)。这是反向传播所必需的,其中这些张量用于计算梯度。一个简单但有效的例子是由 定义的函数f: x -> x\xc2\xb2。在这里,df/dx = 2x即为了计算,df/dx您需要保留x在内存中。
如果您使用torch.no_grad()上下文管理器,您将允许 PyTorch不保存这些值,从而节省内存。这在评估或测试模型时特别有用,即执行反向传播时)特别有用。当然,你在训练期间将无法使用它!
向后传播
\n向后传递调用将在设备上分配额外的内存来存储每个参数的梯度值。只有叶张量节点(模型参数和输入)将其梯度存储在grad属性中。这就是为什么内存使用量仅在推理和计算之间增加的原因backward调用之间增加的原因。
模型参数更新
\n由于您使用的是有状态优化器(Adam),因此需要一些额外的内存来保存一些参数。阅读相关的 PyTorch 论坛帖子。如果您尝试使用无状态优化器(例如SGD),则不应有任何内存开销step。
所有三个步骤都可能有内存需求。总之,设备上分配的内存实际上取决于三个要素:
\n神经网络的大小:模型越大,内存中保存的层激活和梯度就越多。
\n无论您是否在上下文中torch.no_grad:在这种情况下,只有模型的状态需要存储在内存中(不需要激活或梯度)。
使用的优化器类型:是否是有状态的(在参数更新期间保存一些运行估计),还是无状态的(不需要)。
\n您是否需要退货
\n| 归档时间: |
|
| 查看次数: |
4962 次 |
| 最近记录: |