我无法在 PyTorch 中为 BERT 模型初始化组进程,我曾尝试使用以下代码进行初始化:
import torch
import datetime
torch.distributed.init_process_group(
backend='nccl',
init_method='env://',
timeout=datetime.timedelta(0, 1800),
world_size=0,
rank=0,
store=None,
group_name=''
)
Run Code Online (Sandbox Code Playgroud)
并尝试访问该get_world_size()功能:
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
Run Code Online (Sandbox Code Playgroud)
完整代码:
train_examples = None
num_train_optimization_steps = None
if do_train:
train_examples = processor.get_train_examples(data_dir)
num_train_optimization_steps = int(
len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
if local_rank != -1:
import datetime
torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
print(num_train_optimization_steps)
Run Code Online (Sandbox Code Playgroud)
Dan*_*ion 13
我通过参考https://github.com/NVIDIA/apex/issues/99解决了这个问题。具体运行
python -m torch.distributed.launch xxx.py
Run Code Online (Sandbox Code Playgroud)
小智 7
只是更新,而不是运行:
$ python -m torch.distributed.launch --use_env train_script.py
Run Code Online (Sandbox Code Playgroud)
您现在只需要运行:
$ torchrun train_script.py
Run Code Online (Sandbox Code Playgroud)
如此处所示。
| 归档时间: |
|
| 查看次数: |
33090 次 |
| 最近记录: |