Cha*_*ker 8 python gpu machine-learning multi-gpu pytorch
我本来想在 DGX A100 上设置 DDP(分布式数据并行),但它不起作用。每当我尝试运行它时,它就会挂起。我的代码非常简单,只需为 4 个 GPU 生成 4 个进程(为了调试,我只是立即销毁该组,但它甚至没有到达那里):
def find_free_port():
""" /sf/ask/95568581/ """
import socket
from contextlib import closing
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return str(s.getsockname()[1])
def setup_process(rank, world_size, backend='gloo'):
"""
Initialize the distributed environment (for each process).
gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
/sf/ask/4275277331/
https://pytorch.org/docs/stable/distributed.html#common-environment-variables
"""
if rank != -1: # -1 rank indicates serial code
print(f'setting up rank={rank} (with world_size={world_size})')
# MASTER_ADDR = 'localhost'
MASTER_ADDR = '127.0.0.1'
MASTER_PORT = find_free_port()
# set up the master's ip address so this child process can coordinate
os.environ['MASTER_ADDR'] = MASTER_ADDR
print(f"{MASTER_ADDR=}")
os.environ['MASTER_PORT'] = MASTER_PORT
print(f"{MASTER_PORT}")
# - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
if torch.cuda.is_available():
# unsure if this is really needed
# os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
# os.environ['NCCL_IB_DISABLE'] = '1'
backend = 'nccl'
print(f'{backend=}')
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
# dist.init_process_group(backend, rank=rank, world_size=world_size)
# dist.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
print(f'--> done setting up rank={rank}')
dist.destroy_process_group()
mp.spawn(setup_process, args=(4,), world_size=4)
Run Code Online (Sandbox Code Playgroud)
为什么这个挂了?
nvidia-smi 输出:
$ nvidia-smi
Fri Mar 5 12:47:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 26C P0 51W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 25C P0 52W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | 0 |
| N/A 25C P0 51W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | 0 |
| N/A 25C P0 51W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 52W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | 0 |
| N/A 29C P0 53W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 |
| N/A 48C P0 231W / 400W | 7500MiB / 40537MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 7 N/A N/A 147243 C python 7497MiB |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
如何在这台新机器上设置 ddp?
顺便说一句,我已经成功安装了 APEX,因为其他一些链接说要这样做,但它仍然失败。因为我做了:
去了: https: //github.com/NVIDIA/apex按照他们的指示
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Run Code Online (Sandbox Code Playgroud)
但在上面之前我必须更新 gcc:
conda install -c psi4 gcc-5
Run Code Online (Sandbox Code Playgroud)
当我成功导入它时,它确实安装了它,但没有帮助。
现在它实际上打印了一条错误消息:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 252, in train
setup_process(rank, world_size=opts.world_size)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/distributed.py", line 85, in setup_process
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.
During handling of the above exception, another exception occurred:
Run Code Online (Sandbox Code Playgroud)
有关的:
以下修复基于使用 PyTorch 编写分布式应用程序,初始化方法。
问题一:
nprocs=world_size除非您传入,否则它将挂起mp.spawn()。换句话说,它正在等待“整个世界”在流程方面的出现。
问题2:
MASTER_ADDR 和 MASTER_PORT 在每个进程的环境中需要相同,并且需要是运行 0 级进程的计算机上的空闲地址:端口组合。
这两个都是隐含的或直接从上面链接的以下引用中读取(添加了强调):
环境变量
在本教程中我们一直使用环境变量初始化方法。通过在所有机器上设置以下四个环境变量,所有进程都能够正确连接到master,获取其他进程的信息,并最终与它们握手。
MASTER_PORT:机器上的一个空闲端口,将托管等级为 0 的进程。
MASTER_ADDR:将托管等级 0 的进程的计算机的 IP 地址。
WORLD_SIZE:进程总数,以便master知道要等待多少个worker。
RANK:每个进程的等级,这样就可以知道它是否是某个worker的master。
下面是一些代码来演示这两个方法的实际效果:
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os
def find_free_port():
""" https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
import socket
from contextlib import closing
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return str(s.getsockname()[1])
def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
print(f'setting up {rank=} {world_size=} {backend=}')
# set up the master's ip address so this child process can coordinate
os.environ['MASTER_ADDR'] = master_addr
os.environ['MASTER_PORT'] = master_port
print(f"{master_addr=} {master_port=}")
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
print(f"{rank=} init complete")
dist.destroy_process_group()
print(f"{rank=} destroy complete")
if __name__ == '__main__':
world_size = 4
master_addr = '127.0.0.1'
master_port = find_free_port()
mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)
Run Code Online (Sandbox Code Playgroud)