blu*_*lue 8 c++ lambda multithreading gcc c++11
以下代码段采用命令行参数,该参数表示要生成的线程数以同时运行简单的for循环.
如果传递的参数为0,则不会std::thread生成.
在gcc 4.9.2上,./snippet 0比./snippet 1平均值长10%,即产生一个std::thread执行循环的版本比刚刚执行循环的版本更快main.
有谁知道发生了什么?clang-4根本没有显示这种行为(版本只有一个std::thread较慢),gcc 6.2的版本只有一个std::thread运行速度稍快一些(当花费十多次试验的最小时间作为测量值时).
这是片段:ScopedNanoTimer只是一个简单的RAII计时器.我正在编译-g -O3 -pthread -std=c++11.
#include <thread>
#include <vector>
int main(int argc, char** argv) {
// setup
if (argc < 2)
return 1;
const unsigned n_threads = std::atoi(argv[1]);
const auto n_iterations = 1000000000ul / (n_threads > 0u ? n_threads : n_threads + 1);
// define workload
auto task = [n_iterations]() {
volatile auto sum = 0ul;
for (auto i = 0ul; i < n_iterations; ++i) ++sum;
};
// time and print
if (n_threads == 0) {
task();
} else {
std::vector<std::thread> threads;
for (auto i = 0u; i < n_threads; ++i) threads.emplace_back(task);
for (auto &thread : threads) thread.join();
}
return 0;
}
Run Code Online (Sandbox Code Playgroud)
编辑
根据评论中的建议,我试图模糊编译器的事实,即迭代次数是先验已知的逻辑分支n_threads == 0.我改变了相关的行
const auto n_iterations = 1000000000ul / (n_threads > 0u ? n_threads : n_threads + 1);
Run Code Online (Sandbox Code Playgroud)
我还删除了10个执行的外部for循环以及ScopedNanoTimer的所有提及.这些更改现在反映在上面的代码段中.
我使用上面标记编译并在带有Debian linux的工作站上执行了几次,内核版本为3.16.39-1 + deb8u2,处理器为英特尔(R)Core(TM)i7-4790 CPU @ 3.60GHz,四核.所有其他程序都关闭,cpu throttling/intel speed-step/turbo-boost被关闭,cpu调控器策略被设置为"性能".互联网连接已关闭.
趋势总是用gcc-4.9.2编译没有std::threads的版本比产生一个线程的版本快10%左右.相反,clang-4具有相反的(预期的)行为.
以下测量使我确信问题在于gcc-4.9.2次优优化,并且与上下文切换无关,也没有与测量质量差有关.有了这个说,甚至连godbolt的编译器浏览器都没有清楚地告诉我gcc正在做什么,所以我不认为这个问题得到了回答.
使用g ++ - 4.9.2进行时间+上下文切换测量
~$ g++ -std=c++11 -pthread -g -O3 snippet.cpp -o snippet_gcc
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_gcc 0 | egrep '((wall clock)|(switch))'; done
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 7
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.90
Voluntary context switches: 1
Involuntary context switches: 3
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 2
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.08
Voluntary context switches: 1
Involuntary context switches: 4
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_gcc 1 | egrep '((wall clock)|(switch))'; done
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.79
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.95
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.81
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Voluntary context switches: 2
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.97
Voluntary context switches: 2
Involuntary context switches: 3
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.85
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Voluntary context switches: 2
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.95
Voluntary context switches: 2
Involuntary context switches: 5
Run Code Online (Sandbox Code Playgroud)
使用clang ++ - 4.0进行时间+上下文切换测量
~$ clang++ -std=c++11 -pthread -g -O3 snippet.cpp -o snippet_clang
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_clang 0 | egrep '((wall clock)|(switch))'; done
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 7
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 3
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 3
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 1
Involuntary context switches: 4
~$ for i in $(seq 1 10); do /usr/bin/time -v 2>&1 ./snippet_clang 1 | egrep '((wall clock)|(switch))'; done
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 6
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 5
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 2
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 3
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 4
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.67
Voluntary context switches: 2
Involuntary context switches: 7
Run Code Online (Sandbox Code Playgroud)
我认为您可能是劣质测试样本的受害者。我试图重现这种行为,并且在每个选项运行 10 次左右后,我发现我收到的时间方差相对较高。我使用 /usr/bin/time -v 运行了更多测试,发现程序的执行时间与程序经历的非自愿上下文切换的数量密切相关。
Option 0: No threads
time, context switches
20.32, 1806
20.09, 2139
21.01, 1916
21.13, 1873
21.15, 1847
18.67, 1617
19.06, 1692
17.94, 1546
21.40, 1867
18.64, 1629
Option 1: Threads
time, context switches
19.68, 1750
19.60, 1740
19.35, 1783
19.60, 1726
19.95, 1823
20.42, 1800
19.54, 1745
19.40, 1699
19.36, 1703
Run Code Online (Sandbox Code Playgroud)
我认为您可能只是在操作系统的工作负载变化期间运行了基准测试。正如您在上面的时间数据中看到的,超过 20 秒的时间都是在操作系统高负载期间收集的。同样,在低负载期间收集的时间低于 19 秒。
从逻辑上讲,我明白为什么调度线程的循环应该运行得更慢。相对于循环操作来说,创建线程的开销很高,循环操作只是增加一个数字。这会导致运行程序所需的用户时间增加。问题是,与整个循环的执行时间相比,这种用户时间的增加可能可以忽略不计。在程序的生命周期中,您只创建了 10 个额外的线程,并且在这些线程中执行计算与简单地在主线程中执行这些计算应该没有什么区别(如果有的话)。在整个程序过程中,您正在执行数十亿次其他操作,这隐藏了用户时间的增加。如果您确实想对线程的创建时间进行基准测试,您可以编写一个创建大量线程的程序,并且不执行其他操作。您还应该小心地在后台进程尽可能少的环境中运行此类基准测试。
这可能不是问题的全部,但我相信这仍然是值得考虑的。