矩阵转置中的缓存利用率c

Dan*_*Dan 5 c caching computer-architecture

该代码以四种方式转换矩阵.第一个是顺序写入,非顺序读取.第二个是相反的.接下来的两个是相同的,但缓存跳过写入.似乎发生的是顺序写入更快,并且跳过缓存更快.我不明白的是,如果跳过缓存,为什么顺序写入仍然更快?

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
     tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);
Run Code Online (Sandbox Code Playgroud)

编辑:输出是

Transpose 1:    47603
Transpose 2:    92449
Transpose 3:    38340
Transpose 4:    69597
Run Code Online (Sandbox Code Playgroud)

Kam*_*uri 4

CPU 有一个写入组合缓冲区,用于组合高速缓存行上的写入,使其一次性发生。在这种情况下(顺序写入跳过高速缓存),该写入组合缓冲区充当单行高速缓存,这使得结果与未跳过高速缓存非常相似。

确切地说,在跳过缓存的情况下,写入仍然会突发写入内存。

请参阅此处的写入组合逻辑行为。