错误:某些 NCCL 操作失败或超时

Shi*_*hah 7 python distributed gpu pytorch nvidia-docker

在 4 个 A6000 GPU 上运行分布式训练时,出现以下错误:

[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.       
                                                                                                                                                        [E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.                                                                                 

terminate called after throwing an instance of 'std::runtime_error'                                                                                                        
what():  [Rank 2] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.        

[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Run Code Online (Sandbox Code Playgroud)

我使用标准 NVidia PyTorch docker。有趣的是,训练对于小型数据集效果很好,但对于较大的数据集,我会收到此错误。所以我可以确认训练代码是正确的并且有效。

没有实际的运行时错误或任何其他信息来在任何地方获取实际的错误消息。

Shi*_*hah 8

Following two have solved the issue:

  • Increase default SHM (shared memory) for CUDA to 10g (I think 1g would have worked as well). You can do this in docker run command by passing --shm-size=10g. I also pass --ulimit memlock=-1.
  • export NCCL_P2P_LEVEL=NVL.

Debugging Tips

To check current SHM,

df -h
# see the row for shm
Run Code Online (Sandbox Code Playgroud)

To see NCCL debug messages:

export NCCL_DEBUG=INFO
Run Code Online (Sandbox Code Playgroud)

Run p2p bandwidth test for GPU to GPU communication link:

cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
sudo make
./p2pBandwidthLatencyTest
Run Code Online (Sandbox Code Playgroud)

For A6000 4 GPU box this prints:

在此输入图像描述

The matrix shows bandwith betweeb each pair of GPU and with P2P, it should be high.


Ber*_*998 7

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

在 torch.distributed.init_process_group() 中设置超时参数,默认为 30 分钟

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)
Run Code Online (Sandbox Code Playgroud)