PyTorch - 运行时错误:转换:同步失败:cudaErrorIllegalAddress

blu*_*nox 7 python gpu deep-learning pytorch

我遇到的问题是,当我在 Google Colab 上运行模型时,我经常收到此错误:

RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)

我不会一直发生。有时它工作得很好,但有时又不行。
(是否有效似乎与批量大小有关,请参阅底部的编辑。)

仅在CPU上运行时没有问题,因此似乎与GPU/CUDA有关。

错误回溯显示错误发生在 期间的某个时刻backward()

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-c8bbb05fb58a> in <module>()
     65 out   = model(chars, pos)
     66 loss  = F.binary_cross_entropy_with_logits(out, labels)
---> 67 loss.backward()

1 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    164                 products. Defaults to ``False``.
    165         """"""
--> 166         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    167 
    168     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)

一旦发生此错误,我就无法在该运行时执行与 GPU 相关的任何操作。每当我尝试在 GPU 上创建一些张量等时,我都会收到略有不同的错误消息:

RuntimeError: CUDA error: an illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)

我需要重新启动运行时/笔记本才能在 GPU 上执行某些操作(Pytorch 尚未尝试其他框架)。

以下是应该能够在 Google Colab 中重现该问题的代码片段:

import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__, torch.cuda.get_device_name(0))
from torch import nn
from torch.nn import functional as F
class MyModel(nn.Module):
    def __init__(self, char_vocab, num_pos, dim, hidden, dropout):
        super().__init__()
        self.emb_char  = nn.Embedding(char_vocab, dim)
        self.cnn1 = nn.Conv1d(dim, dim // 2, kernel_size=3)
        self.bn1  = nn.BatchNorm1d(dim // 2)
        self.cnn2 = nn.Conv1d(dim // 2, dim // 4, kernel_size=2)
        self.bn2  = nn.BatchNorm1d(dim // 4)
        self.pooling = nn.MaxPool1d(2)
        self.lin1 = nn.Linear(11, hidden)
        self.bn3  = nn.BatchNorm1d(hidden)
        self.lin2 = nn.Linear(hidden, dim // 4)
        self.bn4  = nn.BatchNorm1d(dim // 4)



        self.out  = nn.Linear(dim // 4, 1)
        self.drop = nn.Dropout(dropout)

    def forward(self, chars, pos):
        x = self.emb_char(chars).transpose(1, 2)
        x = self.drop(x)

        x = self.cnn1(x)
        x = self.bn1(x)
        x = self.pooling(x)
        x = F.relu(x)

        x = self.cnn2(x)
        x = self.bn2(x)
        x = self.pooling(x).squeeze(-1)
        x = F.relu(x)
        x = self.drop(x)


        y = F.relu(self.lin1(pos))
        y = self.bn3(y)
        y = self.drop(y)
        y = F.relu(self.lin2(y))
        y = self.bn4(y)
        y = self.drop(y)

        return self.out(x+y)

model = MyModel(char_vocab=80, num_pos=11, dim=32, hidden=8, dropout=0.5).to(device)

batch_size = 160000

# inputs
chars = torch.randint(0, 79, (batch_size, 8), device=device)
pos   = torch.rand(batch_size, 11, device=device)

# labels
labels = torch.ones(batch_size, 1, dtype=torch.float, device=device)

# forward
out   = model(chars, pos)
loss  = F.binary_cross_entropy_with_logits(out, labels)

# backward
loss.backward()
Run Code Online (Sandbox Code Playgroud)

Pytorch 版本是1.3.1它运行的 GPU 是P100-PCIE-16GB.

有什么想法可以摆脱这个错误吗?


编辑:

  • 我添加了一段带有生成输入的代码,可以重现该问题。

  • 这似乎与我使用的高批量大小有关。我在代码中使用的批量大小160000似乎可靠地导致了此错误,但是我注意到它不仅仅取决于批量大小。我尝试将其减少到80000但仍然出现此错误。但我知道,在之前的运行中,我拥有更高的批量大小,但没有出现180000任何问题。同样从工作的批量大小开始60000,然后使用相同的运行时,我可以以增加的批量大小再次运行它,而140000不会出现错误(之前失败了)。
    请注意,内存使用量非常低/仅约 1-2 GB

  • 设置CUDA_LAUNCH_BLOCKING=1会导致不同的错误消息:

    RuntimeError: fractional_max_pool2d_backward_out_cuda failed with error code 0
    
    Run Code Online (Sandbox Code Playgroud)

    但在代码中的同一点并具有相同的回溯。