x86-64 的缓存填充大小应该为 128 字节吗？

Question

x86-64 的缓存填充大小应该为 128 字节吗？

Qua*_*Cat 15 c++ x86-64 rust cpu-cache false-sharing

我发现了一条来自的评论crossbeam。

从 Intel 的 Sandy Bridge 开始，空间预取器现在一次提取成对的 64 字节缓存线，因此我们必须对齐到 128 字节而不是 64。

资料来源：

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

https://github.com/facebook/folly/blob/1b5288e6eea6df074758f877c849b6e73bbb9fbb/folly/lang/Align.h#L107

我在英特尔的手册中没有找到这样的说法。但直到最新的提交，folly仍然使用 128 字节填充，这让我很有说服力。所以我开始编写代码来看看是否可以观察到这种行为。这是我的代码。

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Run Code Online (Sandbox Code Playgroud)

编译器资源管理器

我的CPU是锐龙3700X。当索引为0、1、2、3时，大约需要 1.2 秒才能完成。当索引为0, 16, 32,时48，大约需要 200ms 才能完成。当索引为0, 32, 64,96时，大约需要 200ms 才能完成，与之前完全相同。我还在英特尔机器上测试了它们，它给了我类似的结果。

从这个微型工作台上，我看不出使用 128 字节填充而不是 64 字节填充的原因。我是不是搞错了什么？

Answer 1

Pet*_*des 18

Intel 的优化手册确实描述了 SnB 系列 CPU 中的 L2 空间预取器。是的，当第一条线被拉入时有空闲内存带宽（非核心请求跟踪槽）时，它会尝试完成 128B 对齐的 64B 线对。

您的微基准测试没有显示 64 与 128 字节分隔之间有任何显着的时间差异。如果没有任何实际的错误共享（在同一个 64 字节行内），在经历一些最初的混乱之后，它很快就会达到每个核心对其正在修改的缓存行拥有独占所有权的状态。这意味着不再有 L1d 缺失，因此不会向 L2 发出会触发 L2 空间预取器的请求。

与例如两对线程争夺atomic<int>相邻（或非）缓存行中的单独变量不同。 或者与他们虚假分享。然后，L2 空间预取可以将竞争耦合在一起，因此所有 4 个线程都在相互竞争，而不是 2 个独立的线程对。基本上，在缓存行实际上在核心之间来回跳动的任何情况下，如果您不小心，L2 空间预取都会使情况变得更糟。

（L2 预取器不会无限期地尝试无限期地完成其缓存的每个有效行的行对；这会损害像这样不同的内核重复接触相邻行的情况，而不是有任何帮助。）

了解 std::hardware_delta_interference_size 和 std::hardware_constructive_interference_size包括具有更长基准的答案；我最近没有看过它，但我认为它应该演示 64 字节而不是 128 字节的破坏性干扰。不幸的是，那里的答案没有提到 L2 空间预取作为可能导致一些破坏性干扰的影响之一（尽管不是）与外层高速缓存中的 128 字节行大小一样多，特别是如果它是包容性高速缓存）。

即使与您的基准测试相比，性能计数器也显示出差异

我们可以使用基准测试的性能计数器来测量更多的初始混乱。在我的 i7-6700k（具有超线程的四核 Skylake；4c8t，运行 Linux 5.16）上，我改进了源代码，以便我可以在不破坏内存访问的情况下进行优化编译，并使用 CPP 宏，以便我可以设置步幅（以字节为单位）从编译器命令行。machine_clears.memory_ordering请注意，当我们使用相邻行时，大约 500 个内存顺序错误推测管道会遭到破坏 ( )。实际数量变化很大，从 200 到 850，但对整体时间的影响仍然可以忽略不计。

相邻线，500 +- 300 机器清除

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Run Code Online (Sandbox Code Playgroud)

与 128 字节分隔相比，只有极少数机器清除

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Run Code Online (Sandbox Code Playgroud)

据推测，Linux 如何将线程调度到这台 4c8t 机器上的逻辑核心有一定的依赖性。有关的：

生产者-消费者在超级兄弟与非超级兄弟之间共享内存位置的延迟和吞吐量成本是多少？- 对于共享物理核心的逻辑核心来说，一条线路内的实际错误共享要糟糕得多，但对于相邻线路可能没有影响：每个物理核心的 L2 是相同的，并且两条线路将在 L1d 中保持热状态。
为什么要刷新由其他逻辑处理器引起的内存顺序冲突的管道？

与实际假共享一线之内：10M机器清零

存储缓冲区（和存储转发）为每个错误共享机器清除了一堆增量，因此它并不像人们想象的那么糟糕。（对于原子 RMW，情况会更糟，比如std::atomic<int> fetch_add，因为每个增量在执行时都需要直接访问 L1d 缓存。）为什么错误共享仍然影响非原子，但比原子少得多？

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

Run Code Online (Sandbox Code Playgroud)

改进的基准测试 - 对齐阵列和易失性以允许优化

我使用volatile这样我就可以启用优化。我假设您在禁用优化的情况下进行编译，因此int j也在循环内存储/重新加载。

我使用了alignas(128) counter[]这样的方式来确保数组的开头位于两对 128 字节行中，而不是分布在三对中。

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，6 月前
查看次数：	2210 次
最近记录：	3 年，6 月前