_mm_clflush 真的会刷新缓存吗?

xia*_*ogw 5 linux x86-64 cpu-architecture cpu-cache cache-locality

我试图通过编写和运行测试程序来了解硬件缓存的工作原理:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64

#define L1_WAYS     8
#define L1_SETS     64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    int junk = 0;
    register uint64_t t1, t2;

    printf("data: %p\n", data);

    //_mm_clflush(data);
    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}
Run Code Online (Sandbox Code Playgroud)

我使用 和 不使用 运行代码_mm_clflush,而结果只显示_mm_clflush第一次内存访问速度更快。

_mm_clflush

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 280
i =  1, cycles: 84
i =  2, cycles: 91
i =  3, cycles: 77
i =  4, cycles: 91
Run Code Online (Sandbox Code Playgroud)

_mm_clflush

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 3899
i =  1, cycles: 91
i =  2, cycles: 105
i =  3, cycles: 77
i =  4, cycles: 84
Run Code Online (Sandbox Code Playgroud)

刷新缓存行只是没有意义,但实际上变得更快了?谁能解释为什么会发生这种情况?谢谢

----------------进一步实验-------------------

让我们假设 3899 个周期是由 TLB 未命中引起的。为了证明我对缓存命中/未命中的了解,我稍微修改了此代码以比较L1 cache hit和情况下的内存访问时间L1 cache miss

这一次,代码跳过缓存行大小(64 字节)并访问下一个内存地址。

*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i*LINE_SIZE];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}
Run Code Online (Sandbox Code Playgroud)

由于我的电脑有一个8路组关联L1数据缓存,有64组,总共32KB。如果我每 64 字节访问一次内存,它应该会导致所有缓存未命中。但是好像有很多缓存行已经缓存了:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 273
i =  1, cycles: 70
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 70
i =  5, cycles: 70
i =  6, cycles: 70
i =  7, cycles: 70
i =  8, cycles: 70
i =  9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i =  0, cycles: 301
i =  1, cycles: 133
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 147
i =  5, cycles: 56
i =  6, cycles: 70
i =  7, cycles: 63
i =  8, cycles: 70
i =  9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105
Run Code Online (Sandbox Code Playgroud)

这是由预取引起的吗?还是我的理解有问题?谢谢

Luo*_*ing 2

我通过在之前添加写入来修改代码_mm_clflush(data),它显示 clflush 确实刷新了缓存行。修改后的代码:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    unsigned int junk = 0;
    register uint64_t t1, t2;

    data[0] = 1; //write before cflush
    //_mm_clflush(data);

    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}
Run Code Online (Sandbox Code Playgroud)

我在我的计算机(Intel(R) Core(TM) i5-8500 CPU)上运行修改后的代码并得到以下结果。根据多次尝试,首次访问之前刷新到内存的数据的延迟明显高于未刷新到内存的数据。

没有 cllush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 64
i =  1, cycles: 46
i =  2, cycles: 49
i =  3, cycles: 48
i =  4, cycles: 46
Run Code Online (Sandbox Code Playgroud)

与 clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 214
i =  1, cycles: 41
i =  2, cycles: 40
i =  3, cycles: 42
i =  4, cycles: 40
Run Code Online (Sandbox Code Playgroud)