torch.distributed.barrier() 如何工作

Question

torch.distributed.barrier() 如何工作

hlu*_*hlu 8 pytorch huggingface-transformers

我已经阅读了所有我能找到的关于 torch.distributed.barrier() 的文档，但仍然无法理解它在这个脚本中的使用方式，非常感谢一些帮助。

因此，torch.distributed.barrier的官方文档说它“同步所有进程。如果 async_op 为 False，或者如果在 wait() 上调用了异步工作句柄，则此集体会阻止进程，直到整个组进入此函数为止。”

它在脚本中的两个地方使用：

    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

        ... (preprocesses the data and save the preprocessed data)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()

Run Code Online (Sandbox Code Playgroud)

第二个地方

    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

        ... (loads the model and the vocabulary)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

Run Code Online (Sandbox Code Playgroud)

我无法将代码中的注释与官方文档中所述的此函数的功能联系起来。它如何确保只有第一个进程执行 torch.distributed.barrier() 两次调用之间的代码，以及为什么它只在第二次调用之前检查本地排名是否为 0？

提前致谢！

Answer 1

Bra*_*roy 13

首先，您需要了解等级。简而言之：在多处理上下文中，我们通常假设等级 0 是第一个进程或基础进程。然后其他进程按不同的方式排列，例如 1、2、3，总共四个进程。

有些操作不需要并行完成，或者您只需要一个进程进行一些预处理或缓存，以便其他进程可以使用该数据。

在您的示例中，在非基本进程（等级 1、2、3）输入的第一个 if 语句中，它们将阻塞（或“等待”），因为它们遇到了障碍。他们在那里等待，因为barrier()阻塞直到所有进程都到达屏障，但基础进程还没有到达屏障。

所以此时非基础进程 (1, 2, 3) 被阻塞，但基础进程 (0) 继续。基本进程将执行一些操作（在本例中为预处理和缓存数据），直到它到达第二个 if 语句。在那里，基础进程将遇到障碍。此时，所有进程都停在了barrier处，这意味着barrier可以解除，所有进程都可以继续。因为基础进程准备了数据，其他进程现在可以使用该数据。

也许最重要的是要理解：

当一个进程遇到障碍时，它会阻塞
屏障的位置并不重要（例如，并非所有进程都必须输入相同的 if 语句）
进程被屏障阻塞，直到所有进程都遇到屏障，然后所有进程的屏障被解除

感谢您的澄清：“一个进程被屏障阻塞，直到所有进程都遇到屏障，此时所有进程的屏障都会解除”是有道理的 (3认同)
@QinshengZhang 这些进程不必进入相同的屏障，只需进入*一个*屏障。因此，如果您编写一个只有某些进程进入的 if 函数，那么只有当其他进程到达另一个障碍时它们才会继续。 (2认同)

归档时间：	6 年，2 月前
查看次数：	3971 次
最近记录：	6 年，2 月前