为什么使用 std::async 的并发性比使用 std::thread 更快？

Question

为什么使用 std::async 的并发性比使用 std::thread 更快？

我正在阅读关于并发的“现代 C++ 编程手册，第 2 版”的第 8 章，偶然发现了一些让我感到困惑的东西。

作者使用std::thread和实现了不同版本的并行映射和缩减功能std::async。实现非常接近；例如，parallel_map函数的核心是

// parallel_map using std::async
...
tasks.emplace_back(std::async(
  std::launch::async,
  [=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...

// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...

Run Code Online (Sandbox Code Playgroud)

完整的代码可以发现这里的std::thread和那里的std::async。

让我感到困惑的是，书中报告的计算时间为std::async实现提供了显着且一致的优势。此外，作者承认这一事实是显而易见的，但没有提供任何理由的暗示：

如果我们将这个 [result with async] 与使用线程的并行版本的结果进行比较，我们会发现这些是更快的执行时间并且加速非常显着，尤其是对于fold函数。

我在我的电脑上运行了上面的代码，尽管差异没有书中那么引人注目，但我发现std::async实现确实比那个更快std::thread。（作者后来还带来了这些算法的标准实现，它们甚至更快）。在我的计算机上，该代码以四个线程运行，这与我的 CPU 的物理内核数相对应。

也许我错过了一些东西，但为什么明显std::async应该比std::thread这个例子运行得更快？我的直觉是，std::async作为线程的更高级别实现，它应该至少花费与线程相同的时间，如果不是更多的话——显然我错了。这些发现是否如书中所建议的那样一致，其解释是什么？

Answer 1

asy*_*nts 21

我原来的解释是不正确的。请参阅下面的@OznOg 的回答。

修改后的答案：

我创建了一个简单的基准测试，它使用std::async并std::thread执行一些小任务：

#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>

__thread volatile int you_shall_not_optimize_this;

void work() {
    // This is the simplest way I can think of to prevent the compiler and
    // operating system from doing naughty things
    you_shall_not_optimize_this = 42;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
    std::vector<std::optional<std::thread>> threads;
    threads.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        threads[i] = std::thread { work };

    for (size_t i = 0; i < count; ++i)
        threads[i]->join();

    threads.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
    std::vector<std::optional<std::future<void>>> results;
    results.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        results[i] = std::async(policy, work);

    for (size_t i = 0; i < count; ++i)
        results[i]->wait();

    results.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

std::ostream& operator<<(std::ostream& stream, std::launch value)
{
    if (value == std::launch::async)
        return stream << "std::launch::async";
    else if (value == std::launch::deferred)
        return stream << "std::launch::deferred";
    else
        return stream << "std::launch::unknown";
}

// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async

int main() {
    std::cout << "Running benchmark:\n"
              << "  threads?     " << std::boolalpha << CONFIG_THREADS << '\n'
              << "  iterations   " << CONFIG_ITERATIONS << '\n'
              << "  async policy " << CONFIG_POLICY << std::endl;

    std::chrono::nanoseconds duration;
    if (CONFIG_THREADS) {
        duration = benchmark_threads(CONFIG_ITERATIONS);
    } else {
        duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
    }

    std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}

Run Code Online (Sandbox Code Playgroud)

我已经按如下方式运行了基准测试：

$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::async
Completed in 293539858ns (293ms)

Run Code Online (Sandbox Code Playgroud)

我重新运行了所有带有strace附加和累积系统调用的基准测试：

# std::async with std::launch::async
      1 access
      2 arch_prctl
     36 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
  10002 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::async with std::launch::deferred
      1 access
      2 arch_prctl
     11 brk
      6 close
      1 execve
      1 exit_group
  10002 futex
     28 mmap
      9 mprotect
      2 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
      1 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::async
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::deferred
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

Run Code Online (Sandbox Code Playgroud)

我们观察到这std::async明显更快，std::launch::deferred但其他一切似乎都没有那么重要。

我的结论是：

当前的 libstdc++ 实现没有利用std::async不需要为每个任务创建新线程的事实。
目前的libstdc ++实现执行某种锁定的std::async是std::thread不会做的。
std::asyncwithstd::launch::deferred节省了设置和销毁成本，并且在这种情况下速度更快。

我的机器配置如下：

$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture:                    x86_64
Byte Order:                      Little Endian
CPU(s):                          8
Model name:                      Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Run Code Online (Sandbox Code Playgroud)

原答案：

std::thread 是操作系统提供的线程对象的包装器，创建和销毁它们的成本非常高。

std::async类似，但任务和操作系统线程之间没有一对一的映射。这可以通过线程池来实现，其中线程被重用于多个任务。

所以std::async如果你有很多小任务std::thread会更好，如果你有几个长时间运行的任务会更好。

此外，如果您有真正需要并行发生的事情，那么std::async可能不太适合。（std::thread也不能做出这样的保证，但这是你能得到的最接近的。）

也许澄清一下，在您的情况下，std::async可以节省创建和销毁线程的开销。

（取决于操作系统，你也可能因为运行大量线程而损失性能。操作系统可能有一个调度策略，它试图保证每个线程每隔一段时间执行一次，因此调度程序可以决定去给单个线程的处理时间更小，从而为线程之间的切换创造了更多的开销。）

归档时间：	4 年，6 月前
查看次数：	536 次
最近记录：	4 年，5 月前