Kjy*_*ong 10 cuda gpu nvidia ray pytorch
在使用rayune (1 个 GPU 进行 1 次试验)训练此代码期间,经过几个小时的训练(大约 20 次试验)后,GPU 出现错误:0,1。即使终止训练过程后,GPU 仍然给出错误。CUDA out of memoryout of memory
如上所述,目前我的所有 GPU 设备都是空的。并且除了这两个进程之外没有其他Python进程在运行。
import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working
torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True
Run Code Online (Sandbox Code Playgroud)
# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory
Run Code Online (Sandbox Code Playgroud)
但是,GPU:0,1 会出错out of memory。如果我重新启动计算机(ubuntu 18.04.3),它会恢复正常,但如果我再次训练代码,则会出现同样的问题。
如何调试这个问题,或者在不重新启动的情况下解决它?
GPU has fallen off the bus错误)dmesg | grep -i -e nvidia -e nvrm
[ 5.946174] nvidia: loading out-of-tree module taints kernel.
[ 5.946181] nvidia: module license 'NVIDIA' taints kernel.
[ 5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[ 5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[ 6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[ 6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.27.04 Fri Dec 11 23:35:05 UTC 2020
[ 6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.27.04 Fri Dec 11 23:24:19 UTC 2020
[ 6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[ 6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[ 6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[ 6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[ 6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[ 6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[ 6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[ 6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[ 6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[ 6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[ 6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[ 6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[ 6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[ 6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[ 6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[ 6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[ 7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[ 7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[ 7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[ 7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[ 7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[ 7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[ 7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[ 7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[ 9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs
Run Code Online (Sandbox Code Playgroud)
>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
return current_context().device
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
return _runtime.get_or_create_context(devnum)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
return self._get_or_create_context_uncached(devnum)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
return self._activate_context_for(0)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
newctx = gpu.get_primary_context()
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY
Run Code Online (Sandbox Code Playgroud)
重新启动并将 pytorch 版本升级到 1.9.1+cu111 后似乎没有发生这种情况。
它再次发生,但我无法执行评论部分中建议的命令,因为我没有 root 访问权限。
小智 1
我相信这可能是由于 CUDA 中分配和释放内存时在某些情况下发生内存碎片所致。
在模型训练后尝试 torch.cuda.empty_cache() 或在您的环境中设置 PYTORCH_NO_CUDA_MEMORY_CACHING=1 以禁用缓存,这可能有助于在某些情况下减少 GPU 内存碎片。
对于调试内存,您可以按照以下文档进行操作: https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management
| 归档时间: |
|
| 查看次数: |
5931 次 |
| 最近记录: |