分布式训练找不到Torchrun命令，是否需要单独安装？

Question

分布式训练找不到Torchrun命令，是否需要单独安装？

Cha*_*ker 4 distributed-computing pytorch

我正在关注 torchrun 教程，但我们从未被告知如何安装 torchrun。既然我有 pytorch，它就应该自动存在吗？或者发生了什么事？

输出：

(meta_learning_a100) [miranda9@hal-dgx ~]$ torchrun --nnodes=1 --nproc_per_node=2 ~/ultimate-utils/tutorials_for_myself/my_l2l/dist_maml_l2l_from_seba.py
bash: torchrun: command not found...

Run Code Online (Sandbox Code Playgroud)

我问这个是因为官方火炬页面似乎建议使用它https://pytorch.org/docs/stable/elastic/run.html

例如

TORCHRUN (ELASTIC LAUNCH)
torchrun provides a superset of the functionality as torch.distributed.launch with the following additional functionalities:

Worker failures are handled gracefully by restarting all workers.

Worker RANK and WORLD_SIZE are assigned automatically.

Number of nodes is allowed to change between minimum and maximum sizes (elasticity).

Transitioning from torch.distributed.launch to torchrun
torchrun supports the same arguments as torch.distributed.launch except for --use_env which is now deprecated. To migrate from torch.distributed.launch to torchrun follow these steps:

If your training script is already reading local_rank from the LOCAL_RANK environment variable. Then you need simply omit the --use_env flag, e.g.:

torch.distributed.launch

torchrun

$ python -m torch.distributed.launch --use_env train_script.py
$ torchrun train_script.py
If your training script reads local rank from a --local_rank cmd argument. Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

torch.distributed.launch

torchrun

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()

local_rank = args.local_rank
import os
local_rank = int(os.environ["LOCAL_RANK"])
The aformentioned changes suffice to migrate from torch.distributed.launch to torchrun. To take advantage of new features such as elasticity, fault-tolerance, and error reporting of torchrun please refer to:

Train script for more information on authoring training scripts that are torchrun compliant.

the rest of this page for more information on the features of torchrun.

Usage
Single-node multi-worker

>>> torchrun
    --standalone
    --nnodes=1
    --nproc_per_node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Fault tolerant (fixed sized number of workers, no elasticity):

>>> torchrun
    --nnodes=$NUM_NODES
    --nproc_per_node=$NUM_TRAINERS
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400), specifies the node and the port on which the C10d rendezvous backend should be instantiated and hosted. It can be any node in your training cluster, but ideally you should pick a node that has a high bandwidth.

Run Code Online (Sandbox Code Playgroud)

交叉发布：

Answer 1

Ber*_*ger 5

更准确地说，它是在 pytorch 版本 1.10.0 中添加的，所以是的，如果您安装了 pytorch，它应该会自动存在。 https://github.com/pytorch/pytorch/releases/tag/v1.10.0

如果您使用的是旧版本，您可以使用

python -m torch.distributed.run

代替

火炬奔跑

它只是一个入口点。 https://github.com/pytorch/pytorch/pull/64049

归档时间：	3 年，8 月前
查看次数：	18719 次
最近记录：	3 年，4 月前