Ath*_*dom 8 python reinforcement-learning lstm pytorch dqn
After training a PyTorch model on a GPU for several hours, the program fails with the error
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Training Conditions
nn.LSTM with nn.Linear outputstate passed into forward() has the shape (32, 20, 15), where 32 is the batch sizeMy code also has the following values set before the training began
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
Run Code Online (Sandbox Code Playgroud)
How can we troubleshoot this problem? Since this occurred 8 hours into the training, some educated guess will be very helpful here!
Thanks!
Update:
Commenting out the 2 torch.backends.cudnn... lines did not work. CUDNN_STATUS_INTERNAL_ERROR still occurs, but much earlier at around Episode 300 (585,000 steps).
torch.manual_seed(0)
#torch.backends.cudnn.deterministic = True
#torch.backends.cudnn.benchmark = False
np.random.seed(0)
Run Code Online (Sandbox Code Playgroud)
System
Error Traceback
RuntimeError Traceback (most recent call last)
<ipython-input-18-f5bbb4fdfda5> in <module>
57
58 while not done:
---> 59 action = agent.choose_action(state)
60 state_, reward, done, info = env.step(action)
61 score += reward
<ipython-input-11-5ad4dd57b5ad> in choose_action(self, state)
58 if np.random.random() > self.epsilon:
59 state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60 actions = self.q_eval.forward(state)
61 action = T.argmax(actions).item()
62 else:
<ipython-input-10-94271a92f66e> in forward(self, state)
20
21 def forward(self, state):
---> 22 lstm, hidden = self.lstm(state)
23 actions = self.fc1(lstm[:,-1:].squeeze(1))
24 return actions
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
575 result = self._slow_forward(*input, **kwargs)
576 else:
--> 577 result = self.forward(*input, **kwargs)
578 for hook in self._forward_hooks.values():
579 hook_result = hook(self, input, result)
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\rnn.py in forward(self, input, hx)
571 self.check_forward_args(input, hx, batch_sizes)
572 if batch_sizes is None:
--> 573 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
574 self.dropout, self.training, self.bidirectional, self.batch_first)
575 else:
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Run Code Online (Sandbox Code Playgroud)
Update: Tried try... except on my code where this error occurs at, and in addition to RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR, we also get a second traceback for the error RuntimeError: CUDA error: unspecified launch failure
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-4-e8f15cc8cf4f> in <module>
61
62 while not done:
---> 63 action = agent.choose_action(state)
64 state_, reward, done, info = env.step(action)
65 score += reward
<ipython-input-3-1aae79080e99> in choose_action(self, state)
58 if np.random.random() > self.epsilon:
59 state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60 actions = self.q_eval.forward(state)
61 action = T.argmax(actions).item()
62 else:
<ipython-input-2-6d22bb632c4c> in forward(self, state)
25 except Exception as e:
26 print('error in forward() with state:', state.shape, 'exception:', e)
---> 27 print('state:', state)
28 actions = self.fc1(lstm[:,-1:].squeeze(1))
29 return actions
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\tensor.py in __repr__(self)
152 def __repr__(self):
153 # All strings are unicode in Python 3.
--> 154 return torch._tensor_str._str(self)
155
156 def backward(self, gradient=None, retain_graph=None, create_graph=False):
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _str(self)
331 tensor_str = _tensor_str(self.to_dense(), indent)
332 else:
--> 333 tensor_str = _tensor_str(self, indent)
334
335 if self.layout != torch.strided:
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _tensor_str(self, indent)
227 if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
228 self = self.float()
--> 229 formatter = _Formatter(get_summarized_data(self) if summarize else self)
230 return _tensor_str_with_formatter(self, indent, formatter, summarize)
231
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
99
100 else:
--> 101 nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
102
103 if nonzero_finite_vals.numel() == 0:
RuntimeError: CUDA error: unspecified launch failure
Run Code Online (Sandbox Code Playgroud)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR众所周知,该错误难以调试,但令人惊讶的是,它通常是内存不足问题。通常,您会遇到内存不足错误,但取决于它发生的位置,PyTorch 无法拦截该错误,因此无法提供有意义的错误消息。
在您的情况下似乎可能存在内存问题,因为在代理完成之前您正在使用 while 循环,这可能需要足够长的时间来耗尽内存,这只是时间问题。一旦模型的参数与某个输入相结合无法及时完成,这也可能发生得相当晚。
您可以通过限制允许的操作数量来避免这种情况,而不是希望参与者在合理的时间内完成。
您还需要注意的是,不要占用不必要的内存。一个常见的错误是在未来的迭代中保留过去状态的计算梯度。上次迭代的状态应该被认为是恒定的,因为当前的动作不应该影响过去的动作,因此不需要梯度。这通常是通过从下一次迭代的计算图中分离状态来实现的,例如state = state_.detach()。也许您已经在这样做了,但是没有代码就无法分辨。
同样,如果您保留状态的历史记录,则应该分离它们,更重要的是将它们放在 CPU 上,即history.append(state.detach().cpu()).
任何遇到此错误以及其他 cudnn/gpu 相关错误的人都应该尝试更改模型和 cpu 输入,通常 cpu 运行时具有更好的错误报告,并使您能够调试问题。
根据我的经验,大多数情况下错误来自嵌入的无效索引。