ValueError:使用 env:// rendezvous 初始化 torch.distributed 时出错:需要环境变量 MASTER_ADDR,但未设置

Sur*_*ale 8 python pytorch

我无法在 PyTorch 中为 BERT 模型初始化组进程,我曾尝试使用以下代码进行初始化:

import torch
import datetime

torch.distributed.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=datetime.timedelta(0, 1800),
    world_size=0,
    rank=0,
    store=None,
    group_name=''
)
Run Code Online (Sandbox Code Playgroud)

并尝试访问该get_world_size()功能:

num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
Run Code Online (Sandbox Code Playgroud)

完整代码:

train_examples = None
    num_train_optimization_steps = None
    if do_train:
        train_examples = processor.get_train_examples(data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
        if local_rank != -1:
            import datetime
            torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
            print(num_train_optimization_steps)
Run Code Online (Sandbox Code Playgroud)

Dan*_*ion 13

我通过参考https://github.com/NVIDIA/apex/issues/99解决了这个问题。具体运行

python -m torch.distributed.launch xxx.py
Run Code Online (Sandbox Code Playgroud)


小智 7

只是更新,而不是运行:

$ python -m torch.distributed.launch --use_env train_script.py
Run Code Online (Sandbox Code Playgroud)

您现在只需要运行:

$ torchrun train_script.py
Run Code Online (Sandbox Code Playgroud)

如此处所示。