带有线程的矢量和没有加速

Fab*_*ian 16 c++ multithreading gcc blas c++11

我有一个C++程序,基本上执行一些矩阵计算.对于这些我使用LAPACK/BLAS并且通常根据平台链接到MKL或ACML.很多这些矩阵计算都在不同的独立矩阵上运行,因此我使用std :: thread来让这些操作并行运行.但是,我注意到在使用更多线程时我没有加速.我将问题追溯到daxpy Blas例程.似乎如果两个线程并行使用此例程,则每个线程花费两倍的时间,即使两个线程在不同的阵列上运行.

我尝试的下一件事是编写一个新的简单方法来执行向量添加以替换daxpy例程.使用一个线程,这个新方法与BLAS例程一样快,但是,当使用gcc进行编译时,它会遇到与BLAS例程相同的问题:并行运行的线程数加倍也会使每个线程需要的时间加倍,所以没有加速.但是,使用英特尔C++编译器时,这个问题就会消失:随着线程数量的增加,单个线程需要的时间是不变的.

但是,我需要在没有英特尔编译器的系统上进行编译.所以我的问题是:为什么gcc没有加速,是否有提高gcc性能的可能性?

我写了一个小程序来演示效果:

// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread

#include <iostream>
#include <thread>
#include <vector>

#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"

void simplesum(double* a, double* b, std::size_t dim);

int main() {

    for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
        const std::size_t N { 936 };

        std::vector <std::size_t> times(num_threads, 0);    

        auto threadfunction = [&](std::size_t tid)
        {
            const std::size_t dim { N * N };
            double* pA = new double[dim];
            double* pB = new double[dim];

            for (std::size_t i {0}; i < N; ++i){
                pA[i] = i;
                pB[i] = 2*i;
            }   

            boost::posix_time::ptime now1 = 
                boost::posix_time::microsec_clock::universal_time();    

            for (std::size_t n{0}; n < 1000; ++n){
                simplesum(pA, pB, dim);
            }

            boost::posix_time::ptime now2 = 
                boost::posix_time::microsec_clock::universal_time(); 
            boost::posix_time::time_duration dur = now2 - now1; 
            times[tid] += dur.total_milliseconds(); 
            delete[] pA;
            delete[] pB;
        };

        std::vector <std::thread> mythreads;

        // start threads
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads.emplace_back(threadfunction, n);
        }

        // wait for threads to finish
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads[n].join();
            std::cout << " Thread " << n+1 << " of " << num_threads 
                      << "  took " << times[n]<< "msec" << std::endl;
        }
    }
}

void simplesum(double* a, double* b, std::size_t dim){

    for(std::size_t i{0}; i < dim; ++i)
    {*(++a) += *(++b);}
}
Run Code Online (Sandbox Code Playgroud)

与gcc的外出:

Thread 1 of 1  took 532msec
Thread 1 of 2  took 1104msec
Thread 2 of 2  took 1103msec
Thread 1 of 3  took 1680msec
Thread 2 of 3  took 1821msec
Thread 3 of 3  took 1808msec
Thread 1 of 4  took 2542msec
Thread 2 of 4  took 2536msec
Thread 3 of 4  took 2509msec
Thread 4 of 4  took 2515msec
Run Code Online (Sandbox Code Playgroud)

与icc的outout:

Thread 1 of 1  took 663msec
Thread 1 of 2  took 674msec
Thread 2 of 2  took 674msec
Thread 1 of 3  took 681msec
Thread 2 of 3  took 681msec
Thread 3 of 3  took 681msec
Thread 1 of 4  took 688msec
Thread 2 of 4  took 689msec
Thread 3 of 4  took 687msec
Thread 4 of 4  took 688msec
Run Code Online (Sandbox Code Playgroud)

因此,使用icc一个线程执行计算所需的时间是恒定的(正如我预期的那样;我的CPU有4个物理内核)和gcc一个线程的时间增加.用BLAS :: daxpy替换simplesum例程会产生与icc和gcc相同的结果(毫不奇怪,因为大部分时间花在库中),这几乎与上面提到的gcc结果相同.

gha*_*.st 18

答案很简单:你的线程正在争夺内存带宽!

考虑每2个存储执行一次浮点加法(一次初始化,一次加法后)和2次读取(加法中).提供多个cpu的大多数现代系统实际上必须在多个内核之间共享内存控制器.

以下是在具有2个物理CPU插槽和12个内核(24个带HT)的系统上运行.您的原始代码完全展示了您的问题:

Thread 1 of 1  took 657msec
Thread 1 of 2  took 1447msec
Thread 2 of 2  took 1463msec
[...]
Thread 1 of 8  took 5516msec
Thread 2 of 8  took 5587msec
Thread 3 of 8  took 5205msec
Thread 4 of 8  took 5311msec
Thread 5 of 8  took 2731msec
Thread 6 of 8  took 5545msec
Thread 7 of 8  took 5551msec
Thread 8 of 8  took 4903msec
Run Code Online (Sandbox Code Playgroud)

但是,通过简单地增加算术密度,我们可以看到可扩展性的显着增加.为了演示,我更改了您的添加例程以执行取幂:*(++a) += std::exp(*(++b));.结果显示几乎完美的缩放:

Thread 1 of 1  took 7671msec
Thread 1 of 2  took 7759msec
Thread 2 of 2  took 7759msec
[...]
Thread 1 of 8  took 9997msec
Thread 2 of 8  took 8135msec
Thread 3 of 8  took 10625msec
Thread 4 of 8  took 8169msec
Thread 5 of 8  took 10054msec
Thread 6 of 8  took 8242msec
Thread 7 of 8  took 9876msec
Thread 8 of 8  took 8819msec
Run Code Online (Sandbox Code Playgroud)

但ICC怎么样?

首先,ICC内联simplesum.证明内联发生很简单:使用icc,我已禁用多文件过程间优化并移入simplesum其自己的翻译单元.差异是惊人的.表演来自

Thread 1 of 1  took 687msec
Thread 1 of 2  took 688msec
Thread 2 of 2  took 689msec
[...]
Thread 1 of 8  took 690msec
Thread 2 of 8  took 697msec
Thread 3 of 8  took 700msec
Thread 4 of 8  took 874msec
Thread 5 of 8  took 878msec
Thread 6 of 8  took 874msec
Thread 7 of 8  took 742msec
Thread 8 of 8  took 868msec
Run Code Online (Sandbox Code Playgroud)

Thread 1 of 1  took 1278msec
Thread 1 of 2  took 2457msec
Thread 2 of 2  took 2445msec
[...]
Thread 1 of 8  took 8868msec
Thread 2 of 8  took 8434msec
Thread 3 of 8  took 7964msec
Thread 4 of 8  took 7951msec
Thread 5 of 8  took 8872msec
Thread 6 of 8  took 8286msec
Thread 7 of 8  took 5714msec
Thread 8 of 8  took 8241msec
Run Code Online (Sandbox Code Playgroud)

这已经解释了为什么库表现不好:ICC无法内联它,因此无论其他什么原因导致ICC比g ++表现更好,它都不会发生.

它还提示了ICC在这里可能正在做什么......如果不是执行simplesum1000次,它会交换循环以便它

  1. 加载两个双打
  2. 添加1000次(甚至执行= 1000*b)
  3. 存储两个双打

这会增加算术密度而不向函数添加任何指数......如何证明这一点?好吧,首先让我们简单地实现这个优化,看看会发生什么!为了分析,我们将看看g ++的性能.回顾我们的基准测试结果:

Thread 1 of 1  took 640msec
Thread 1 of 2  took 1308msec
Thread 2 of 2  took 1304msec
[...]
Thread 1 of 8  took 5294msec
Thread 2 of 8  took 5370msec
Thread 3 of 8  took 5451msec
Thread 4 of 8  took 5527msec
Thread 5 of 8  took 5174msec
Thread 6 of 8  took 5464msec
Thread 7 of 8  took 4640msec
Thread 8 of 8  took 4055msec
Run Code Online (Sandbox Code Playgroud)

现在,让我们交流吧

for (std::size_t n{0}; n < 1000; ++n){
    simplesum(pA, pB, dim);
}
Run Code Online (Sandbox Code Playgroud)

使用内部循环的外部循环版本:

double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
    double x = *a, y = *b;
    for (std::size_t n{0}; n < 1000; ++n)
    {
        x += y;
    }
    *a = x;
}
Run Code Online (Sandbox Code Playgroud)

结果表明我们走在正确的轨道上:

Thread 1 of 1  took 693msec
Thread 1 of 2  took 703msec
Thread 2 of 2  took 700msec
[...]
Thread 1 of 8  took 920msec
Thread 2 of 8  took 804msec
Thread 3 of 8  took 750msec
Thread 4 of 8  took 943msec
Thread 5 of 8  took 909msec
Thread 6 of 8  took 744msec
Thread 7 of 8  took 759msec
Thread 8 of 8  took 904msec
Run Code Online (Sandbox Code Playgroud)

这证明了环路交换优化确实是ICC在此展出的优秀性能的主要来源.

请注意,所有经过测试的编译器(MSVC,ICC,g ++和clang)都不会使用乘法替换循环,这样可以在单线程中将性能提高200倍,在8线程情况下提高15倍.这是由于重复添加的数值不稳定性在用单个乘法替换时可能导致截然不同的结果.使用整数数据类型而不是浮点数据类型进行测试时,会发生此优化.

我们如何强制g ++执行此优化?

有趣的是,g ++的真正杀手并不是无法执行循环交换.调用时-floop-interchange,g ++也可以执行此优化.但只有在赔率明显受到支持的情况下.

  1. 而不是std::size_t所有界限都表示为ints.不long,不unsigned int,但是int.我仍然觉得很难相信,但似乎这是一个很难的要求.

  2. 而不是递增指针,索引它们: a[i] += b[i];

  3. 需要告诉G ++ -floop-interchange.简单-O3是不够的.

当满足所有三个标准时,g ++性能与ICC提供的相似:

Thread 1 of 1  took 714msec
Thread 1 of 2  took 724msec
Thread 2 of 2  took 721msec
[...]
Thread 1 of 8  took 782msec
Thread 2 of 8  took 1221msec
Thread 3 of 8  took 1225msec
Thread 4 of 8  took 781msec
Thread 5 of 8  took 788msec
Thread 6 of 8  took 1262msec
Thread 7 of 8  took 1226msec
Thread 8 of 8  took 820msec
Run Code Online (Sandbox Code Playgroud)

注意:本实验中使用的g ++版本在x64 Arch linux上为4.9.0.