将带有 NT 存储的 64 字节内存复制到 1 个完整缓存行与 2 个连续的部分缓存行

Question

将带有 NT 存储的 64 字节内存复制到 1 个完整缓存行与 2 个连续的部分缓存行

St.*_*rio 7 c performance x86 assembly avx

我正在阅读有关写入组合内存的英特尔优化手册并编写了基准测试以了解其工作原理。这些是我正在运行基准测试的 2 个函数：

memcopy.h：

void avx_ntcopy_cache_line(void *dest, const void *src);

void avx_ntcopy_64_two_cache_lines(void *dest, const void *src);

Run Code Online (Sandbox Code Playgroud)

memcopy.S：

avx_ntcopy_cache_line:
    vmovdqa ymm0, [rdi]
    vmovdqa ymm1, [rdi + 0x20]
    vmovntdq [rsi], ymm0
    vmovntdq [rsi + 0x20], ymm1
    ;intentionally no sfence after nt-store
    ret

avx_ntcopy_64_two_cache_lines:
    vmovdqa ymm0, [rdi]
    vmovdqa ymm1, [rdi + 0x40]
    vmovntdq [rsi], ymm0
    vmovntdq [rsi + 0x40], ymm1
    ;intentionally no sfence after nt-store
    ret

Run Code Online (Sandbox Code Playgroud)

这是基准测试的主要功能的样子：

#include <stdlib.h>
#include <inttypes.h>
#include <x86intrin.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include "memcopy.h"

#define ITERATIONS 1000000

//As @HadiBrais noted, there might be an issue with 4K aliasing
_Alignas(64) char src[128];
_Alignas(64) char dest[128];

static void run_benchmark(unsigned runs, unsigned run_iterations,
                    void (*fn)(void *, const void*), void *dest, const void* src);

int main(void){
    int fd = open("/dev/urandom", O_RDONLY);
    read(fd, src, sizeof src);

    run_benchmark(20, ITERATIONS, avx_ntcopy_cache_line, dest, src);
    run_benchmark(20, ITERATIONS, avx_ntcopy_64_two_cache_lines, dest, src);
}

static int uint64_compare(const void *u1, const void *u2){
    uint64_t uint1 = *(uint64_t *) u1;
    uint64_t uint2 = *(uint64_t *) u2;
    if(uint1 < uint2){
        return -1;
    } else if (uint1 == uint2){
        return 0;
    } else {
        return 1;
    }
}

static inline uint64_t benchmark_2cache_lines_copy_function(unsigned iterations, void (*fn)(void *, const void *),
                                               void *restrict dest, const void *restrict src){
    uint64_t *results = malloc(iterations * sizeof(uint64_t));
    unsigned idx = iterations;
    while(idx --> 0){
        uint64_t start = __rdpmc((1<<30)+1);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        fn(dest, src);
        uint64_t finish = __rdpmc((1<<30)+1);
        results[idx] = (finish - start) >> 4;
    }
    qsort(results, iterations, sizeof *results, uint64_compare);
    //median
    return results[iterations >> 1];
}

static void run_benchmark(unsigned runs, unsigned run_iterations,
                    void (*fn)(void *, const void*), void *dest, const void* src){
    unsigned current_run = 1;
    while(current_run <= runs){
        uint64_t time = benchmark_2cache_lines_copy_function(run_iterations, fn, dest, src);
        printf("Run %d result: %lu\n", current_run, time);
        current_run++;
    }
}

Run Code Online (Sandbox Code Playgroud)

使用选项编译

-Werror \
-Wextra
-Wall \
-pedantic \
-Wno-stack-protector \
-g3 \
-O3 \
-Wno-unused-result \
-Wno-unused-parameter

Run Code Online (Sandbox Code Playgroud)

并运行基准测试我得到以下结果：

一 avx_ntcopy_cache_line：

Run 1 result: 61
Run 2 result: 61
Run 3 result: 61
Run 4 result: 61
Run 5 result: 61
Run 6 result: 61
Run 7 result: 61
Run 8 result: 61
Run 9 result: 61
Run 10 result: 61
Run 11 result: 61
Run 12 result: 61
Run 13 result: 61
Run 14 result: 61
Run 15 result: 61
Run 16 result: 61
Run 17 result: 61
Run 18 result: 61
Run 19 result: 61
Run 20 result: 61

Run Code Online (Sandbox Code Playgroud)

perf：

 Performance counter stats for './bin':

     3?503?775?289      L1-dcache-loads                                               (18,87%)
        91?965?805      L1-dcache-load-misses     #    2,62% of all L1-dcache hits    (18,94%)
     2?041?496?256      L1-dcache-stores                                              (19,01%)
         5?461?440      LLC-loads                                                     (19,08%)
         1?108?179      LLC-load-misses           #   20,29% of all LL-cache hits     (19,10%)
        18?028?817      LLC-stores                                                    (9,55%)
       116?865?915      l2_rqsts.all_pf                                               (14,32%)
                 0      sw_prefetch_access.t1_t2                                      (19,10%)
           666?096      l2_lines_out.useless_hwpf                                     (19,10%)
        47?701?696      l2_rqsts.pf_hit                                               (19,10%)
        62?556?656      l2_rqsts.pf_miss                                              (19,10%)
         4?568?231      load_hit_pre.sw_pf                                            (19,10%)
        17?113?190      l2_rqsts.rfo_hit                                              (19,10%)
        15?248?685      l2_rqsts.rfo_miss                                             (19,10%)
        54?460?370      LD_BLOCKS_PARTIAL.ADDRESS_ALIAS                                     (19,10%)
    18?469?040?693      uops_retired.stall_cycles                                     (19,10%)
    16?796?868?661      uops_executed.stall_cycles                                     (19,10%)
    18?315?632?129      uops_issued.stall_cycles                                      (19,05%)
    16?176?115?539      resource_stalls.sb                                            (18,98%)
    16?424?440?816      resource_stalls.any                                           (18,92%)
    22?692?338?882      cycles                                                        (18,85%)

       5,780512545 seconds time elapsed

       5,740239000 seconds user
       0,040001000 seconds sys

Run Code Online (Sandbox Code Playgroud)

二、 avx_ntcopy_64_two_cache_lines：

Run 1 result: 6
Run 2 result: 6
Run 3 result: 6
Run 4 result: 6
Run 5 result: 6
Run 6 result: 6
Run 7 result: 6
Run 8 result: 6
Run 9 result: 6
Run 10 result: 6
Run 11 result: 6
Run 12 result: 6
Run 13 result: 6
Run 14 result: 6
Run 15 result: 6
Run 16 result: 6
Run 17 result: 6
Run 18 result: 6
Run 19 result: 6
Run 20 result: 6

Run Code Online (Sandbox Code Playgroud)

perf：

 Performance counter stats for './bin':

     3?095?792?486      L1-dcache-loads                                               (19,26%)
        82?194?718      L1-dcache-load-misses     #    2,66% of all L1-dcache hits    (18,99%)
     1?793?291?250      L1-dcache-stores                                              (19,00%)
         4?612?503      LLC-loads                                                     (19,01%)
           975?438      LLC-load-misses           #   21,15% of all LL-cache hits     (18,94%)
        15?707?916      LLC-stores                                                    (9,47%)
        97?928?734      l2_rqsts.all_pf                                               (14,20%)
                 0      sw_prefetch_access.t1_t2                                      (19,21%)
           532?203      l2_lines_out.useless_hwpf                                     (19,19%)
        35?394?752      l2_rqsts.pf_hit                                               (19,20%)
        56?303?030      l2_rqsts.pf_miss                                              (19,20%)
         6?197?253      load_hit_pre.sw_pf                                            (18,93%)
        13?458?517      l2_rqsts.rfo_hit                                              (18,94%)
        14?031?767      l2_rqsts.rfo_miss                                             (18,93%)
        36?406?273      LD_BLOCKS_PARTIAL.ADDRESS_ALIAS                                     (18,94%)
     2?213?339?719      uops_retired.stall_cycles                                     (18,93%)
     1?225?185?268      uops_executed.stall_cycles                                     (18,94%)
     1?943?649?682      uops_issued.stall_cycles                                      (18,94%)
       126?401?004      resource_stalls.sb                                            (19,20%)
       202?537?285      resource_stalls.any                                           (19,20%)
     5?676?443?982      cycles                                                        (19,18%)

       1,521271014 seconds time elapsed

       1,483660000 seconds user
       0,032253000 seconds sys

Run Code Online (Sandbox Code Playgroud)

可以看出，测量结果有 10 倍的差异。

我的解释：

如中所述Intel Optimization Manual/3.6.9：

对同一高速缓存线不同部分的写入可以分组为单个完整的高速缓存线总线事务，而不是作为多个部分写入通过总线（因为它们没有被高速缓存）

我假设在avx_ntcopy_cache_line我们有完整的 64 字节写入启动总线事务以将它们写出的情况下，禁止rdtsc乱序执行。

相比之下，在avx_ntcopy_64_two_cache_lines我们将 32 个字节写入到 WC 缓冲区的不同缓存行的情况下，总线事务没有被触发。这允许rdtsc乱序执行。

这种解释看起来非常可疑，并且与bus-cycles差异不符：

avx_ntcopy_cache_line: 131?454?700

avx_ntcopy_64_two_cache_lines: 31?957?050

问题：造成这种测量差异的真正原因是什么？

Answer 1

Pet*_*des 4

假设：对尚未刷新的 WC 缓冲区的（完全）重叠存储可以合并到其中。完成一行会立即触发刷新，并且所有那些远离核心的存储都很慢。

您报告resource_stalls.sb的全线版本比 2 部分线版本多 100 倍。这与这个解释是一致的。

如果 2_lines 可以将 NT 存储提交到现有的 WC 缓冲区 (LFB) 中，则存储缓冲区可以跟上存储指令执行的速度，通常会在其他方面造成瓶颈。（考虑到每对加载/存储的调用/返回开销，可能只是前端。当然，确实call包括一个存储。）您的perf结果显示，在 57 亿个周期内有 18 亿个存储（到 L1），因此完全在我们可能期望 WC 缓冲区中的存储命中达到 1 个存储/周期限制。

但是，如果 WC 缓冲区被刷新（这发生在一行被完全写入时），它必须离开核心（这很慢），占用该 LFB 一段时间，这样它就不能用于提交以后的 NT 存储。当存储无法离开存储缓冲区时，它就会填满，并且核心无法为新存储指令分配资源以进入后端。（特别是问题/重命名/分配阶段停滞。）

您可能可以通过任何 L2、L3、SQ、offcore req/resp 事件更清楚地看到这种效果，这些事件将拾取 L1 之外的所有流量。您包括一些 L2 计数器，但这些计数器可能不会拾取通过 L2 的 NT 存储。

memcpy 的增强型 REP MOVSB表明NT 存储需要更长的时间让 LFB“移交”到内存层次结构的外层，从而在请求开始其旅程后很长时间内保持 LFB 被占用。（也许是为了确保核心始终可以重新加载它刚刚存储的内容，或者不会丢失对正在运行的 NT 存储的跟踪，以保持与 MESI 的一致性。）稍后sfence还需要知道早期的 NT 存储何时对其他核心可见，所以在此之前的任何时候我们都不能让它们不可见。

即使情况并非如此，所有这些 NT 存储请求仍然会在某个地方出现吞吐量瓶颈。因此，另一种可能的机制是它们填充了一些缓冲区，然后内核无法再移交 LFB，因此它用完了 LFB 来提交 NT 存储，然后 SB 填充停顿分配。

一旦到达内存控制器，它们可能会合并，而不需要每个都需要通过实际的外部内存总线进行突发传输，但从核心通过非核心到内存控制器的路径并不短。

即使rdpmc每 32 个存储执行 2x 操作，CPU 速度也不足以阻止存储缓冲区被填满；您所看到的取决于在相对紧密的循环中运行它，而不是从空存储缓冲区开始的一次性执行。另外，您的建议是rdpmc不会rdtsc重新排序。WC 缓冲区刷新毫无意义。商店的执行不是按顺序进行的。的执行rdtsc。

TL:DR：对rdpmc单个存储组进行计时并没有什么帮助，而且如果有什么通过减慢不会成为存储缓冲区瓶颈的快速情况来隐藏一些性能差异的话。

我最不确定的一件事是，如果同一行完全用 NT 存储写入，然后再次背靠背，真的会导致两次刷新一直到内存。第二个请求可能会赶上第一个请求，至少在某些时候，例如在外部缓存级别中，这似乎是合理的。也许这种情况确实发生在这里，但当然性能仍然会很糟糕，除非这种情况发生很多。查看离核请求事件以查看是否与基准测试中的写入存在 1:1 对应关系会很有趣。 (2认同)
因此，只有当它到达 MC 队列时，看起来才有很好的机会合并请求（据我所知，MC 肯定会进行这种合并）。 (2认同)

归档时间：	6 年前
查看次数：	221 次
最近记录：	6 年前