高速缓存内存优化数组转置:C

Question

高速缓存内存优化数组转置:C

typedef int array[2][2];

void transpose(array dst, array src) {
    int i, j;
    for (j = 0; j < 2; j++) {
        for (i = 0; i < 2; i++) {
            dst[i][j] = src[j][i];
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

src数组从地址0开始,dst数组从地址0x10开始.

L1数据缓存,直接映射,写分配,8字节块大小.

缓存总大小为16个数据字节.

src和dst数组的每个条目的命中或错过是什么？

答案是:

src: 
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> miss

Run Code Online (Sandbox Code Playgroud)

如果缓存总大小为32个数据字节,答案是:

src: 
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

Run Code Online (Sandbox Code Playgroud)

我不确定这两种结果.我不太了解数组和缓存的概念.

Answer 1

Spe*_*ump 1

因此，在第一个实例中，您有两个缓存行，每个缓存行 8 字节，总共 16 字节。我假设 int 数据大小为 4 个字节。考虑到 C 中数组的放置以及您提供的偏移量，这些是可以缓存的内存行：

Cacheable lines:
#A: &src[0][0] = 0x00, &src[0][1] = 0x04
#B: &src[1][0] = 0x08, &src[1][1] = 0x0C
#C: &dst[0][0] = 0x10, &dst[0][1] = 0x14
#D: &dst[1][0] = 0x18, &dst[1][1] = 0x1C

Run Code Online (Sandbox Code Playgroud)

那么我们需要知道程序访问每个内存地址的访问顺序。我假设没有可能导致编译器重新排序的优化。

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 1 (LRU, replaces line A)
#6: save dst[1][0] --> Miss line D = cache slot 2 (LRU, replaces line C)
#7: load src[1][1] --> Hit  line B = cache slot 1
#8: save dst[1][1] --> Hit  line D = cache slot 2

Run Code Online (Sandbox Code Playgroud)

我认为这与你的第二个答案相符。然后，假设所有其他因素不变，缓存大小为 32 字节（4 行）：

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 3
#6: save dst[1][0] --> Miss line D = cache slot 4
#7: load src[1][1] --> Hit  line B = cache slot 3
#8: save dst[1][1] --> Hit  line D = cache slot 4

Run Code Online (Sandbox Code Playgroud)

它们是相同的。唯一的区别是您是否再次重新运行转置。在情况 1 中，你会得到完全相同的行为（好吧，你会从缓存中充满所有错误的东西开始，所以它也可能是空的）。不过，在缓存较大的情况下，第二次调用所需的所有内容都已缓存，因此不会出现缓存未命中的情况。

我的答案和你的答案之间的差异很可能是由于我们对循环计数寄存器（i 和 j）的编译器行为的假设造成的。我假设它们都存储在寄存器中（因此不会影响数据缓存）。您可能需要假设它们位于内存中的某个位置（因此与缓存交互）才能获得预期的结果。

归档时间：	12 年，1 月前
查看次数：	1301 次
最近记录：	10 年，1 月前