使用 Pytorch 的多处理和分布式

Question

使用 Pytorch 的多处理和分布式

我正在尝试在 openmpi 分布式后端中使用 pytorch 的多处理模块生成几个进程。我所拥有的是以下代码：

def run(rank_local, rank, world_size, maingp):
    print("I WAS SPAWNED ", rank_local, " OF ", rank)

    tensor = torch.zeros(1)
    tensor += 1

    if rank == 0:
        tensor += 100
        dist.send(tensor, dst=1)
    else:
        print("I am spawn: ", rank, "and my tensor value before receive: ", tensor[0])
        dist.recv(tensor, src=0)
        print("I am spawn: ", rank, "and my tensor value after  receive: ", tensor[0])


if __name__ == '__main__':

    # Initialize Process Group
    dist.init_process_group(backend="mpi", group_name="main")
    maingp = None #torch.distributed.new_group([0,1])
    mp.set_start_method('spawn')    

    # get current process information
    world_size = dist.get_world_size()
    rank = dist.get_rank()

    # Establish Local Rank and set device on this node
    mp.spawn(run, args=(rank, world_size, maingp), nprocs=1)

Run Code Online (Sandbox Code Playgroud)

我使用 openmpi 运行此代码，如下所示：

mpirun -n 2 python code.py

Run Code Online (Sandbox Code Playgroud)

所以我的理解是 mpirun 创建了两个等级为 [0, 1] 的进程，这些进程中的每一个都产生了本地等级为 0 的新进程。现在如果我想在主进程的这两个子进程之间进行通信，我会得到一些回溯和以下错误：

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/usama/code/test/code.py", line 19, in run
    dist.send(tensor, dst=1)
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 666, in send
    _check_default_pg()
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Run Code Online (Sandbox Code Playgroud)

我的问题是如何使这些子进程能够进行通信，即 [0, 0] 进程向 [1, 0] 进程发送一些东西。有任何想法吗？

Answer 1

mir*_*phd 1

有时，由于过早优化，我们的问题变得过于严格，例如本例中 MPI 后端的选择……这实际上可能是不可能的，因为流行的分布式训练框架 Ray 支持另外两个后端 NCCL 和 Gloo。不支持MPI，查看其代码：

运行时错误为Backend.MPI

使用 Ray 对具有MPI 以外后端的 PyTorch 模型进行分布式训练的示例（来源）：

import pytorch_lightning as pl
from ray_lightning import RayPlugin

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
plugin = RayPlugin(num_workers=4, num_cpus_per_worker=1, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., gpus=1, plugins=[plugin])
trainer.fit(ptl_model)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，8 月前
查看次数：	1819 次
最近记录：	6 年前