blu*_*nox 7 python gpu deep-learning pytorch
我遇到的问题是,当我在 Google Colab 上运行模型时,我经常收到此错误:
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)
我不会一直发生。有时它工作得很好,但有时又不行。
(是否有效似乎与批量大小有关,请参阅底部的编辑。)
仅在CPU上运行时没有问题,因此似乎与GPU/CUDA有关。
错误回溯显示错误发生在 期间的某个时刻backward()。
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-1-c8bbb05fb58a> in <module>()
65 out = model(chars, pos)
66 loss = F.binary_cross_entropy_with_logits(out, labels)
---> 67 loss.backward()
1 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
164 products. Defaults to ``False``.
165 """"""
--> 166 torch.autograd.backward(self, gradient, retain_graph, create_graph)
167
168 def register_hook(self, hook):
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
---> 99 allow_unreachable=True) # allow_unreachable flag
100
101
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)
一旦发生此错误,我就无法在该运行时执行与 GPU 相关的任何操作。每当我尝试在 GPU 上创建一些张量等时,我都会收到略有不同的错误消息:
RuntimeError: CUDA error: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)
我需要重新启动运行时/笔记本才能在 GPU 上执行某些操作(Pytorch 尚未尝试其他框架)。
以下是应该能够在 Google Colab 中重现该问题的代码片段:
import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__, torch.cuda.get_device_name(0))
from torch import nn
from torch.nn import functional as F
class MyModel(nn.Module):
def __init__(self, char_vocab, num_pos, dim, hidden, dropout):
super().__init__()
self.emb_char = nn.Embedding(char_vocab, dim)
self.cnn1 = nn.Conv1d(dim, dim // 2, kernel_size=3)
self.bn1 = nn.BatchNorm1d(dim // 2)
self.cnn2 = nn.Conv1d(dim // 2, dim // 4, kernel_size=2)
self.bn2 = nn.BatchNorm1d(dim // 4)
self.pooling = nn.MaxPool1d(2)
self.lin1 = nn.Linear(11, hidden)
self.bn3 = nn.BatchNorm1d(hidden)
self.lin2 = nn.Linear(hidden, dim // 4)
self.bn4 = nn.BatchNorm1d(dim // 4)
self.out = nn.Linear(dim // 4, 1)
self.drop = nn.Dropout(dropout)
def forward(self, chars, pos):
x = self.emb_char(chars).transpose(1, 2)
x = self.drop(x)
x = self.cnn1(x)
x = self.bn1(x)
x = self.pooling(x)
x = F.relu(x)
x = self.cnn2(x)
x = self.bn2(x)
x = self.pooling(x).squeeze(-1)
x = F.relu(x)
x = self.drop(x)
y = F.relu(self.lin1(pos))
y = self.bn3(y)
y = self.drop(y)
y = F.relu(self.lin2(y))
y = self.bn4(y)
y = self.drop(y)
return self.out(x+y)
model = MyModel(char_vocab=80, num_pos=11, dim=32, hidden=8, dropout=0.5).to(device)
batch_size = 160000
# inputs
chars = torch.randint(0, 79, (batch_size, 8), device=device)
pos = torch.rand(batch_size, 11, device=device)
# labels
labels = torch.ones(batch_size, 1, dtype=torch.float, device=device)
# forward
out = model(chars, pos)
loss = F.binary_cross_entropy_with_logits(out, labels)
# backward
loss.backward()
Run Code Online (Sandbox Code Playgroud)
Pytorch 版本是1.3.1它运行的 GPU 是P100-PCIE-16GB.
有什么想法可以摆脱这个错误吗?
编辑:
我添加了一段带有生成输入的代码,可以重现该问题。
这似乎与我使用的高批量大小有关。我在代码中使用的批量大小160000似乎可靠地导致了此错误,但是我注意到它不仅仅取决于批量大小。我尝试将其减少到80000但仍然出现此错误。但我知道,在之前的运行中,我拥有更高的批量大小,但没有出现180000任何问题。同样从工作的批量大小开始60000,然后使用相同的运行时,我可以以增加的批量大小再次运行它,而140000不会出现错误(之前失败了)。
请注意,内存使用量非常低/仅约 1-2 GB
设置CUDA_LAUNCH_BLOCKING=1会导致不同的错误消息:
RuntimeError: fractional_max_pool2d_backward_out_cuda failed with error code 0
Run Code Online (Sandbox Code Playgroud)
但在代码中的同一点并具有相同的回溯。