标签: cache-locality

_mm_clflush 真的会刷新缓存吗？

我试图通过编写和运行测试程序来了解硬件缓存的工作原理：

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64

#define L1_WAYS     8
#define L1_SETS     64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    int junk = 0;
    register uint64_t t1, t2;

    printf("data: %p\n", data);

    //_mm_clflush(data);
    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1; …

Run Code Online (Sandbox Code Playgroud)

linux x86-64 cpu-architecture cpu-cache cache-locality

xia*_*ogw

2018 09-27

5
推荐指数

1
解决办法

2452
查看次数

数组迭代中的CPU空间缓存局部性

我对L1缓存的理解是内存提取加载了缓存行.假设高速缓存行大小为64个字节,如果我在地址访问存储器p,它将从加载整个块p到p + 64到缓存中.因此,最好从左到右(而不是从右到左)迭代一个数组,以最大化缓存局部性.

但是,我编写了一个示例C代码,它分配了一个包含1亿个字符的数组,将随机值写入其中并对其求和(下面复制以供参考).一个版本的代码从左到右,另一个从右到左.当我对它进行基准测试时,我得到了非常类似的结果(其中"时钟周期"是根据测量的clock.代码是在没有优化的情况下编译的.

所以我的问题是:现代处理器做的不仅仅是"缓存读取+ 64字节"吗？他们是向前和向后缓存吗？编译器可以"告诉"处理器代码向后迭代吗？

作为参考,我正在Mac OS X 10.13.3使用gcc-7 (Homebrew GCC 7.2.0_1) 7.2.0具有64字节缓存行的x86-64 Intel处理器.

Benchmakrs:

$ ./a.out
Backward Iterating...took 150101 clock cycles

$ ./a.out
Forward Iterating...took 146545 clock cycles

Run Code Online (Sandbox Code Playgroud)

我希望前向迭代速度大约快64倍,因为每64个元素应该是缓存命中,而对于反向迭代,每个元素应该是缓存未命中.

所以,我打电话给cachegrind.两者的缓存命中率几乎相同:

# Left to right iteration
==21773==
==21773== I   refs:      4,006,996,067
==21773== I1  misses:            5,183
==21773== LLi misses:            3,019
==21773== I1  miss rate:          0.00%
==21773== LLi miss rate:          0.00%
==21773==
==21773== D   refs:      1,802,393,260  (1,401,627,925 rd   + …

Run Code Online (Sandbox Code Playgroud)

c x86 cpu-cache cache-locality gcc7

use*_*793

lucky-day

3
推荐指数

1
解决办法

240
查看次数

为什么将排序的键插入 std::set 比插入混洗的键快得多？

我意外地发现插入排序的键std::set比插入混洗的键要快得多。这有点违反直觉，因为一棵红黑树（我证实std::set作为自平衡二叉搜索树在我的系统上是作为红黑树实现的）需要做很多重新平衡操作来插入排序的序列键，因此插入排序的键应该比插入混洗的键花费更多的时间。

但事实是，插入排序的键比插入混洗的键快 15 倍！这是我的测试代码和一些结果：

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <set>
#include <vector>
using namespace std;

int64_t insertion_time(const vector<int> &keys) {    
        auto start = chrono::system_clock::now();
        set<int>(keys.begin(), keys.end());
        auto stop = chrono::system_clock::now();
        auto elapsed = chrono::duration_cast<chrono::milliseconds>(stop - start);
        return elapsed.count(); 
}

int main() {
    size_t test_size;
    cout << "test size: ";
    cin >> test_size;
    vector<int> keys(test_size);
    for (int i = 0; i < test_size; ++i) {
        keys[i] = i;
    }
    
    // whether shuffled case …

Run Code Online (Sandbox Code Playgroud)

c++ stl red-black-tree stdset cache-locality

Leo*_*ruz

lucky-day

3
推荐指数

1
解决办法

65
查看次数