PyTorch Model Training: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Question

PyTorch Model Training: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Ath*_*dom 8 python reinforcement-learning lstm pytorch dqn

After training a PyTorch model on a GPU for several hours, the program fails with the error

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Training Conditions

Neural Network: PyTorch 4-layer nn.LSTM with nn.Linear output
Deep Q Network Agent (Vanilla DQN with Replay Memory)
state passed into forward() has the shape (32, 20, 15), where 32 is the batch size
50 seconds per episode
Error occurs after about 583 episodes (8 hours) or 1,150,000 steps, where each step involves a forward pass through the LSTM model.

My code also has the following values set before the training began

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)

Run Code Online (Sandbox Code Playgroud)

How can we troubleshoot this problem? Since this occurred 8 hours into the training, some educated guess will be very helpful here!

Thanks!

Update:

Commenting out the 2 torch.backends.cudnn... lines did not work. CUDNN_STATUS_INTERNAL_ERROR still occurs, but much earlier at around Episode 300 (585,000 steps).

torch.manual_seed(0)
#torch.backends.cudnn.deterministic = True
#torch.backends.cudnn.benchmark = False
np.random.seed(0)

Run Code Online (Sandbox Code Playgroud)

System

PyTorch 1.6.0.dev20200525
CUDA 10.2
cuDNN 7604
Python 3.8
Windows 10
nVidia 1080 GPU

Error Traceback

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-f5bbb4fdfda5> in <module>
     57 
     58     while not done:
---> 59         action = agent.choose_action(state)
     60         state_, reward, done, info = env.step(action)
     61         score += reward

<ipython-input-11-5ad4dd57b5ad> in choose_action(self, state)
     58         if np.random.random() > self.epsilon:
     59             state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60             actions = self.q_eval.forward(state)
     61             action = T.argmax(actions).item()
     62         else:

<ipython-input-10-94271a92f66e> in forward(self, state)
     20 
     21     def forward(self, state):
---> 22         lstm, hidden = self.lstm(state)
     23         actions = self.fc1(lstm[:,-1:].squeeze(1))
     24         return actions

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    575             result = self._slow_forward(*input, **kwargs)
    576         else:
--> 577             result = self.forward(*input, **kwargs)
    578         for hook in self._forward_hooks.values():
    579             hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\rnn.py in forward(self, input, hx)
    571         self.check_forward_args(input, hx, batch_sizes)
    572         if batch_sizes is None:
--> 573             result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
    574                               self.dropout, self.training, self.bidirectional, self.batch_first)
    575         else:

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Run Code Online (Sandbox Code Playgroud)

Update: Tried try... except on my code where this error occurs at, and in addition to RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR, we also get a second traceback for the error RuntimeError: CUDA error: unspecified launch failure

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-e8f15cc8cf4f> in <module>
     61 
     62     while not done:
---> 63         action = agent.choose_action(state)
     64         state_, reward, done, info = env.step(action)
     65         score += reward

<ipython-input-3-1aae79080e99> in choose_action(self, state)
     58         if np.random.random() > self.epsilon:
     59             state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60             actions = self.q_eval.forward(state)
     61             action = T.argmax(actions).item()
     62         else:

<ipython-input-2-6d22bb632c4c> in forward(self, state)
     25         except Exception as e:
     26             print('error in forward() with state:', state.shape, 'exception:', e)
---> 27             print('state:', state)
     28         actions = self.fc1(lstm[:,-1:].squeeze(1))
     29         return actions

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\tensor.py in __repr__(self)
    152     def __repr__(self):
    153         # All strings are unicode in Python 3.
--> 154         return torch._tensor_str._str(self)
    155 
    156     def backward(self, gradient=None, retain_graph=None, create_graph=False):

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _str(self)
    331                 tensor_str = _tensor_str(self.to_dense(), indent)
    332             else:
--> 333                 tensor_str = _tensor_str(self, indent)
    334 
    335     if self.layout != torch.strided:

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _tensor_str(self, indent)
    227     if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
    228         self = self.float()
--> 229     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    230     return _tensor_str_with_formatter(self, indent, formatter, summarize)
    231 

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
     99 
    100         else:
--> 101             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
    102 
    103             if nonzero_finite_vals.numel() == 0:

RuntimeError: CUDA error: unspecified launch failure

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mic*_*ngo 9

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR众所周知，该错误难以调试，但令人惊讶的是，它通常是内存不足问题。通常，您会遇到内存不足错误，但取决于它发生的位置，PyTorch 无法拦截该错误，因此无法提供有意义的错误消息。

在您的情况下似乎可能存在内存问题，因为在代理完成之前您正在使用 while 循环，这可能需要足够长的时间来耗尽内存，这只是时间问题。一旦模型的参数与某个输入相结合无法及时完成，这也可能发生得相当晚。

您可以通过限制允许的操作数量来避免这种情况，而不是希望参与者在合理的时间内完成。

您还需要注意的是，不要占用不必要的内存。一个常见的错误是在未来的迭代中保留过去状态的计算梯度。上次迭代的状态应该被认为是恒定的，因为当前的动作不应该影响过去的动作，因此不需要梯度。这通常是通过从下一次迭代的计算图中分离状态来实现的，例如state = state_.detach()。也许您已经在这样做了，但是没有代码就无法分辨。

同样，如果您保留状态的历史记录，则应该分离它们，更重要的是将它们放在 CPU 上，即history.append(state.detach().cpu()).

Answer 2

Rij*_*pta 9

任何遇到此错误以及其他 cudnn/gpu 相关错误的人都应该尝试更改模型和 cpu 输入，通常 cpu 运行时具有更好的错误报告，并使您能够调试问题。

根据我的经验，大多数情况下错误来自嵌入的无效索引。

归档时间：	5 年，9 月前
查看次数：	13105 次
最近记录：	4 年，11 月前