运行时错误:模块必须在设备 cuda:1 (device_ids[0]) 上具有其参数和缓冲区,但在设备上找到其中之一:cuda:2

vpa*_*pap 13 parallel-processing pytorch

我有 4 个 GPU(0,1,2,3),我想在 GPU 2 上运行一个 Jupyter notebook,在 GPU 0 上运行另一个。因此,在执行之后,

 export CUDA_VISIBLE_DEVICES=0,1,2,3
Run Code Online (Sandbox Code Playgroud)

对于我做的 GPU 2 笔记本,

device = torch.device( f'cuda:{2}' if torch.cuda.is_available() else 'cpu')
device, torch.cuda.device_count(), torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.get_device_properties(1)
Run Code Online (Sandbox Code Playgroud)

在创建新模型或加载一个模型后,

model = nn.DataParallel( model, device_ids = [ 0, 1, 2, 3])
model = model.to( device)
Run Code Online (Sandbox Code Playgroud)

然后,当我开始训练模型时,我得到,

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-849ffcb53e16> in <module>
 46             with torch.set_grad_enabled( phase == 'train'):
 47                 # [N, Nclass, H, W]
 ---> 48                 prediction = model(X)
 49                 # print( prediction.shape, y.shape)
 50                 loss_matrix = criterion( prediction, y)

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491             result = self._slow_forward(*input, **kwargs)
492         else:
--> 493             result = self.forward(*input, **kwargs)
494         for hook in self._forward_hooks.values():
495             hook_result = hook(self, input, result)

~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144                 raise RuntimeError("module must have its parameters and buffers "
145                                    "on device {} (device_ids[0]) but found one of "
--> 146                                    "them on device: {}".format(self.src_device_obj, t.device))
147 
148         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
Run Code Online (Sandbox Code Playgroud)

jod*_*dag 17

DataParallel要求在其device_ids列表中第一个设备上提供每个输入张量。

在分散到其他 GPU 之前,它基本上使用该设备作为暂存区,并且它是在从前向返回之前收集最终输出的设备。如果您希望设备 2 成为主要设备,那么您只需将其放在列表的前面,如下所示

model = nn.DataParallel(model, device_ids = [2, 0, 1, 3])
model.to(f'cuda:{model.device_ids[0]}')
Run Code Online (Sandbox Code Playgroud)

之后提供给模型的所有张量也应该在第一个设备上。

x = ... # input tensor
x = x.to(f'cuda:{model.device_ids[0]}')
y = model(x)
Run Code Online (Sandbox Code Playgroud)

  • 执行 `model.to(f'cuda:{model.device_ids[0]}')` 不会仅使用 GPU 2 并违背模型并行的目的吗? (3认同)