我使用pytorch分布式训练我的模型。我有两个节点和每个节点两个gpu,我为一个节点运行代码:
python train_net.py --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpu 2 --num-machines 2 --machine-rank 0 --dist-url tcp://192.168.**.***:8000
Run Code Online (Sandbox Code Playgroud)
和另一个:
python train_net.py --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpu 2 --num-machines 2 --machine-rank 1 --dist-url tcp://192.168.**.***:8000
Run Code Online (Sandbox Code Playgroud)
但是另一个有 RuntimeError 问题
global_rank 3 machine_rank 1 num_gpus_per_machine 2 local_rank 1
global_rank 2 machine_rank 1 num_gpus_per_machine 2 local_rank 0
Traceback (most recent call last):
File "train_net.py", line 109, in <module>
args=(args,),
File "/root/detectron2_repo/detectron2/engine/launch.py", line 49, in launch
daemon=False,
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise …Run Code Online (Sandbox Code Playgroud)