在 main() 之外初始化 std::vector 会导致性能下降(多线程)

Jac*_*ack 3 c++ performance multithreading vector c++11

我正在编写路径跟踪器作为编程练习。昨天我终于决定实现多线程 - 它运行良好。然而,一旦我将我main()在其中编写的测试代码包装在一个单独的renderer类中,我注意到性能显着且一致的下降。简而言之 - 似乎填充std::vector之外的任何地方main()都会导致使用其元素的线程性能更差。我设法用简化的代码隔离并重现了这个问题,但不幸的是,我仍然不知道为什么会发生这种情况,也不知道该怎么做才能修复它。

性能下降非常明显且一致:

  97 samples - time = 28.154226s, per sample = 0.290250s, per sample/th = 1.741498
  99 samples - time = 28.360723s, per sample = 0.286472s, per sample/th = 1.718832
 100 samples - time = 29.335468s, per sample = 0.293355s, per sample/th = 1.760128

vs.

  98 samples - time = 30.197734s, per sample = 0.308140s, per sample/th = 1.848841
  99 samples - time = 30.534240s, per sample = 0.308427s, per sample/th = 1.850560
 100 samples - time = 30.786519s, per sample = 0.307865s, per sample/th = 1.847191
Run Code Online (Sandbox Code Playgroud)

我最初在这个问题中发布的代码可以在这里找到:https : //github.com/Jacajack/rt/tree/mt_debug或在编辑历史记录中。

我创建了一个 structfoo来模拟我的renderer类的行为,并负责在其构造函数中初始化路径跟踪上下文。有趣的是,当我删除foo的构造函数的主体并执行此操作时(contexts直接从初始化main()):

  97 samples - time = 28.154226s, per sample = 0.290250s, per sample/th = 1.741498
  99 samples - time = 28.360723s, per sample = 0.286472s, per sample/th = 1.718832
 100 samples - time = 29.335468s, per sample = 0.293355s, per sample/th = 1.760128

vs.

  98 samples - time = 30.197734s, per sample = 0.308140s, per sample/th = 1.848841
  99 samples - time = 30.534240s, per sample = 0.308427s, per sample/th = 1.850560
 100 samples - time = 30.786519s, per sample = 0.307865s, per sample/th = 1.847191
Run Code Online (Sandbox Code Playgroud)

性能恢复正常。但是,如果我将这三行包装成一个单独的函数并从这里调用它,情况又会更糟。我在这里看到的唯一模式是填充contexts外部的向量main()会导致问题。

我最初认为这是一个对齐/缓存问题,所以我尝试将path_tracers 与 Boostaligned_allocator和 TBB对齐cache_aligned_allocator,但没有结果。事实证明,即使只有一个线程在运行,这个问题仍然存在 我怀疑它一定是某种疯狂的编译器优化(我正在使用-O3),尽管这只是一个猜测。您是否知道此类行为的任何可能原因以及可以采取哪些措施来避免这种行为?

这发生在gcc10.1.0 和clang10.0.0 上。目前我只使用-O3.

我设法在这个独立示例中重现了一个类似的问题:

std::vector<rt::path_tracer> contexts; // Can be on stack or on heap, doesn't matter
foo F(cam, scene, bvh, width, height, render_threads, contexts); // no longer fills `contexts`

contexts.reserve(render_threads);
for (int i = 0; i < render_threads; i++)
    contexts.emplace_back(cam, scene, bvh, width, height, 1000 + i);

F.run(render_threads);
Run Code Online (Sandbox Code Playgroud)

和结果:

For N = 100000000, thread_count = 6

In main():
 196 samples - time = 26.789526s, per sample = 0.136681s, per sample/th = 0.820088
 197 samples - time = 27.045646s, per sample = 0.137288s, per sample/th = 0.823725
 200 samples - time = 27.312159s, per sample = 0.136561s, per sample/th = 0.819365


vs.
In foo::foo():
 193 samples - time = 22.690566s, per sample = 0.117568s, per sample/th = 0.705406
 196 samples - time = 22.972403s, per sample = 0.117206s, per sample/th = 0.703237
 198 samples - time = 23.257542s, per sample = 0.117462s, per sample/th = 0.704774
 200 samples - time = 23.540432s, per sample = 0.117702s, per sample/th = 0.706213

Run Code Online (Sandbox Code Playgroud)

结果似乎与我的路径跟踪器中发生的情况相反,但可见的差异仍然存在。

谢谢

Max*_*kin 5

有一个竞争条件foo::buf- 一个线程将存储放入其中,花药读取它。这是未定义的行为,但在 x86-64 平台上,在此特定代码中是无害的。


我无法在 Intel i9-9900KS 上重现您的观察结果,两种变体都打印相同的per sample统计数据。

用 gcc-8.4 编译, g++ -o release/gcc/test.o -c -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG test.cc

随着int N = 50000000;每个线程它自己的阵列上运行的float[N]占用200MB。这样的数据集不适合 CPU 缓存,并且程序会导致大量数据缓存未命中,因为它需要从内存中获取数据:

$ perf stat -ddd ./release/gcc/test
[...]
      71474.813087      task-clock (msec)         #    6.860 CPUs utilized          
                66      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           341,942      page-faults               #    0.005 M/sec                  
   357,027,759,875      cycles                    #    4.995 GHz                      (30.76%)
   991,950,515,582      instructions              #    2.78  insn per cycle           (38.43%)
   105,609,126,987      branches                  # 1477.571 M/sec                    (38.40%)
       155,426,137      branch-misses             #    0.15% of all branches          (38.39%)
   150,832,846,580      L1-dcache-loads           # 2110.294 M/sec                    (38.41%)
     4,945,287,289      L1-dcache-load-misses     #    3.28% of all L1-dcache hits    (38.44%)
     1,787,635,257      LLC-loads                 #   25.011 M/sec                    (30.79%)
     1,103,347,596      LLC-load-misses           #   61.72% of all LL-cache hits     (30.81%)
   <not supported>      L1-icache-loads                                             
         7,457,756      L1-icache-load-misses                                         (30.80%)
   150,527,469,899      dTLB-loads                # 2106.021 M/sec                    (30.80%)
        54,966,843      dTLB-load-misses          #    0.04% of all dTLB cache hits   (30.80%)
            26,956      iTLB-loads                #    0.377 K/sec                    (30.80%)
           415,128      iTLB-load-misses          # 1540.02% of all iTLB cache hits   (30.79%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.419122076 seconds time elapsed

Run Code Online (Sandbox Code Playgroud)

如果您在 NUMA CPU 上运行此应用程序,例如具有多个插槽的 AMD Ryzen 和 Intel Xeon,那么您的观察可能可以通过线程相对于foo::buf分配的NUMA 节点在远程 NUMA 节点上的不利放置来解释。那些最后一级的数据缓存未命中必须读取内存,如果该内存位于需要更长时间的远程 NUMA 节点中。

为了解决这个问题,您可能希望在使用它的线程中分配内存(而不是像代码那样在主线程中)并使用 NUMA-aware 分配器,例如TCMalloc。有关更多详细信息,请参阅NUMA 感知堆内存管理器


运行基准测试时,您可能希望修复 CPU 频率,以便在运行期间不会动态调整它,在 Linux 上,您可以使用sudo cpupower frequency-set --related --governor performance.