每当我使用 torch.multiprocessing.spawn 在多个 GPU 上并行化时,包括并行和分布式训练教程中的代码示例,我都会收到错误。
\n\n\n异常:进程 0 终止,退出代码为 1\n\xe2\x80\x8b
\n
有谁知道“以退出代码1终止”的含义(即进程终止的原因)?
\nPytorch DDP 中的示例注释:
\nimport torch\nimport torch.distributed as dist\nimport torch.multiprocessing as mp\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\n\ndef example(rank, world_size):\n # create default process group\n dist.init_process_group("gloo", rank=rank, world_size=world_size)\n # create local model\n model = nn.Linear(10, 10).to(rank)\n # construct DDP model\n ddp_model = DDP(model, device_ids=[rank])\n # define loss function and optimizer\n loss_fn = nn.MSELoss()\n optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)\n\n # forward pass\n …Run Code Online (Sandbox Code Playgroud)