Why is PPL significantly slower than sequential loop and OpenMP in this case

Question

Why is PPL significantly slower than sequential loop and OpenMP in this case

Tho*_*ell 3 c++ performance openmp c++-amp ppl

Further to my question on CodeReview, I am wondering why the PPL implementation of a simple transform of two vectors using std::plus<int> was so much slower than the sequential std::transform and using a for loop with OpenMP (sequential (with vectorization): 25ms, sequential (without vectorization): 28ms, C++AMP: 131ms, PPL: 51ms, OpenMP: 24ms).

我使用以下代码进行分析并在 Visual Studio 2013 中使用完全优化进行编译：

#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>

using namespace concurrency;

const std::size_t size = 30737418;

//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
    accelerator default_device;
    std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
    if( default_device == accelerator( accelerator::direct3d_ref ) )
        std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

    std::mt19937 engine;
    std::uniform_int_distribution<int> dist( 0, 10000 );

    std::vector<int> vecTest( size );
    std::vector<int> vecTest2( size );
    std::vector<int> vecResult( size );

    for( int i = 0; i < size; ++i )
    {
        vecTest[i] = dist( engine );
        vecTest2[i] = dist( engine );
    }

    std::vector<int> vecCorrectResult( size );

    std::chrono::high_resolution_clock clock;
    auto beginTime = clock.now();

    std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );

    auto endTime = clock.now();
    auto timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

#pragma loop(no_vector)
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

    concurrency::array_view<const int, 1> av1( vecTest );
    concurrency::array_view<const int, 1> av2( vecTest2 );
    concurrency::array_view<int, 1> avResult( vecResult );
    avResult.discard_data();

    concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
        avResult[index] = av1[index] + av2[index];
    } );

    avResult.synchronize();
    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

    concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

#pragma omp parallel
#pragma omp for
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Hri*_*iev 5

根据 MSDN，默认分区器concurrency::parallel_transform是concurrency::auto_partitioner. 当谈到它时：

这种分区方法使用范围窃取来进行负载平衡以及每次迭代取消。

使用此分区器对于简单（且受内存限制）的操作（例如对两个数组求和）来说是一种矫枉过正，因为开销很大。您应该改为使用concurrency::static_partitioner. 当构造中schedule缺少子句时，静态分区正是大多数 OpenMP 实现默认使用的for。

正如代码审查中已经提到的，这是一个非常受内存限制的代码。它也是STREAM 基准测试的SUM内核，该基准测试专门用于测量运行它的系统的内存带宽。该a[i] = b[i] + c[i]操作的操作强度非常低（以 OPS/字节为单位），其速度完全由主内存总线的带宽决定。这就是为什么 OpenMP 代码和矢量化串行代码提供基本相同的性能，这并不比非矢量化串行代码的性能高多少。

获得更高并行性能的方法是在现代多套接字系统上运行代码，并使每个数组中的数据均匀分布在套接字上。然后你可以获得几乎等于 CPU 插槽数量的加速。

归档时间：	11 年，4 月前
查看次数：	880 次
最近记录：	11 年，4 月前