相关疑难解决方法(0)

如何解决著名的“未处理的 cuda 错误，NCCL 版本 2.7.8”错误？

我见过多个有关以下问题的问题：

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Run Code Online (Sandbox Code Playgroud)

但似乎没有人能帮我解决这个问题：

我尝试torch.cuda.set_device(device)在每个脚本的开头手动执行。这似乎对我不起作用。我尝试过不同的 GPU。我尝试过降级pytorch版本和cuda版本。1.6.0、1.7.1、1.8.0 和 cuda 10.2、11.0、11.1 的不同组合。我不确定还能做什么。人们做了什么来解决这个问题？

也许非常相关？

Pytorch“NCCL 错误”：未处理的系统错误，NCCL 版本 2.4.8”

更完整的错误消息：

('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2

opts.world_size=2

ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store …

Run Code Online (Sandbox Code Playgroud)

distributed-computing distributed-system pytorch

Cha*_*ker

2022 05-13

20
推荐指数

1
解决办法

3万
查看次数

如何解决 dist.init_process_group 挂起（或死锁）？

我本来想在 DGX A100 上设置 DDP（分布式数据并行），但它不起作用。每当我尝试运行它时，它就会挂起。我的代码非常简单，只需为 4 个 GPU 生成 4 个进程（为了调试，我只是立即销毁该组，但它甚至没有到达那里）：

def find_free_port():
    """ /sf/ask/95568581/ """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])

def setup_process(rank, world_size, backend='gloo'):
    """
    Initialize the distributed environment (for each process).

    gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
    it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.

    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_DISABLE=1

    /sf/ask/4275277331/

    https://pytorch.org/docs/stable/distributed.html#common-environment-variables
    """
    if rank …

Run Code Online (Sandbox Code Playgroud)

python gpu machine-learning multi-gpu pytorch

Cha*_*ker

2021 03-06

8
推荐指数

1
解决办法

2万
查看次数