错误的单线程内存带宽基准

Question

错误的单线程内存带宽基准

Nit*_*lly 5 c++ benchmarking assembly performance-testing memory-bandwidth

为了测量主存储器的带宽，我提出了以下方法。

代码（针对英特尔编译器）

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937


int main()
{
    // test-parameters
    const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
    }
    
    // REPORT
    const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

代码注释

被认为是一种“天真的”方法，也仅限于 Linux。仍应作为模型性能的粗略指标
将 ICC 与编译器标志一起使用-O3 -ffast-math -march=coffeelake
大小 (150 MiB) 远大于系统最低级缓存（i5-8400 Coffee Lake 上的 9 MiB），具有 2x 16 GiB DIMM DDR4 3200 MT/s
每次迭代的新分配应该使前一次的所有缓存行无效（以消除缓存命中）
记录最小延迟是为了抵消中断和操作系统调度的影响：线程暂时脱离核心等。
进行预热运行是为了抵消动态频率缩放的影响（内核功能，也可以使用调速器关闭userspace）。

代码结果

在我的机器上，我的速度为90 GB/s。运行自己的基准测试的英特尔顾问已计算或测量到该带宽实际上为 25 GB/s。（请参阅我之前的问题：英特尔顾问的带宽信息，其中该代码的先前版本在定时区域内出现页面错误。）

程序集：这是为上述代码生成的程序集的链接： https: //godbolt.org/z/Ma7PY49bE

我无法理解我如何用我的带宽得到如此不合理的高结果。任何有助于促进我理解的提示将不胜感激。

Answer 1

Nit*_*lly 1

实际上，问题似乎是“为什么获得的带宽如此高？”，我从@PeterCordes和@Sebastian那里得到了很多输入。这些信息需要在自己的时间里消化。

我仍然可以对感兴趣的主题提供辅助“答案”。通过用廉价的方法代替写操作（据我现在的理解，如果不深入研究程序集，就无法在基准测试中正确建模），我们可以防止编译器做得太好。

更新了代码

#include <omp.h> #include <iostream> // std::cout #include <limits> // std::numeric_limits #include <cstdlib> // std::free #include <unistd.h> // sysconf #include <stdlib.h> // posix_memalign int main() { // test-parameters const auto size = std::size_t{100 * 1024 * 1024}; const auto experiment_count = std::size_t{250}; //+///////////////// // access a data-point 'on a whim' //+///////////////// // allocate for exp. data and load the memory pages char* data = nullptr; posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size); if (data == nullptr) { std::cerr << "Fatal error! Unable to allocate memory." << std::endl; std::abort(); } for (auto index = std::size_t{}; index < size; ++index) { data[index] = 0; } // timed run auto min_duration = std::numeric_limits<double>::max(); for (auto counter = std::size_t{}; counter < experiment_count; ++counter) { // run const auto dur1 = omp_get_wtime() * 1E+6; #pragma omp parallel for simd safelen(8) schedule(static) for (auto index = std::size_t{}; index < size; ++index) { data[index] ^= 1; } const auto dur2 = omp_get_wtime() * 1E+6; const auto run_duration = dur2 - dur1; if (run_duration < min_duration) { min_duration = run_duration; } } // deallocate resources free(data); // REPORT const auto traffic = size * 2; // 1x load, 1x write std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n" << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl; return 0; }
Run Code Online (Sandbox Code Playgroud)
该基准仍然是一个“幼稚”的基准，仅作为模型性能的指标（与可以精确计算内存带宽的程序相反）。

使用更新后的代码，单线程速度为 24 GiB/s，当所有 6 个核心都参与时，速度为 37 GiB/s。与 Intel Advisor 的测量值 25.5 GiB/s 和 37.5 GiB/s 相比，我认为这是可以接受的。

@PeterCordes我保留了预热循环，以便对整个过程进行完全相同的运行，以便抵消未知的影响（健康程序员的偏执狂）。

编辑在这种情况下，预热循环确实是多余的，因为正在计时最短持续时间。

归档时间：	3 年，9 月前
查看次数：	268 次
最近记录：	3 年，9 月前