并行化for循环不会带来性能提升

Question

并行化for循环不会带来性能提升

ela*_*dan 16 c++ winapi tbb parallel-for ppl

我有一个算法将拜耳图像通道转换为RGB.在我的实现中,我有一个嵌套for循环,它遍历拜耳通道,从拜耳索引计算rgb索引,然后从拜耳通道设置该像素的值.这里要注意的主要事实是每个像素可以独立于其他像素计算(不依赖于先前的计算),因此该算法是并行化的自然候选者.但是,计算依赖于某些预设数组,所有线程将在同一时间访问但不会更改.

然而,当我尝试将主要for与MS 并行化时,我的cuncurrency::parallel_for性能没有提升.事实上,对于在4核CPU上运行的大小为3264X2540的输入,非并行化版本在~34ms内运行,并行化版本运行在~69ms(平均超过10次运行).我确认该操作确实是并行化的(为该任务创建了3个新线程).

使用英特尔的编译器提供tbb::parallel_for了接近完全的结果.为了比较,我开始使用这个算法实现,C#其中我也使用了parallel_for循环,在那里我遇到了接近X4的性能提升(我选择了C++因为这个特定任务C++即使使用单个核心也更快).

有什么想法阻止我的代码很好地并行化？

我的代码:

template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
        //Translates index offset in Bayer image to channel offset in RGB image
        int offsets[4];
        //calculate offsets according to color space
        switch (ColorSpace)
        {
        case ColorSpace::BGGR:
            offsets[0] = 2;
            offsets[1] = 1;
            offsets[2] = 1;
            offsets[3] = 0;
            break;
        ...other color spaces
        }
        memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        parallel_for(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row%2)*2 + (col%2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Evg*_*yuk 22

首先,您的算法是有限的内存带宽.那就是内存加载/存储将超过你所做的任何索引计算.

像SSE/ 一样的矢量操作AVX也没有帮助 - 你没有进行任何密集的计算.

每次迭代增加作业量也没用-既PPL和TBB足够聪明,不创建每次迭代的线程,他们会用一些好的分区,这将additionaly尽量保存地点.例如,这里引用TBB::parallel_for:

当工作线程可用时,parallel_for执行迭代是非确定性顺序.不要依赖任何特定的执行顺序来保证正确性.但是,为了提高效率,请确保parallel_for倾向于在连续运行的值上运行.

真正重要的是减少内存操作.对输入或输出缓冲区进行任何多余的遍历都会对性能产生影响,因此您应该尝试删除memset或并行执行此操作.

您完全遍历输入和输出数据.即使你跳过输出中的东西 - 这也不重要,因为内存操作是在现代硬件上由64字节块发生的.因此,计算size您的输入和输出,time算法的度量,将结果与系统的最大特征进行划分size/ time比较(例如,使用基准测量).

我做了测试Microsoft PPL,OpenMP并且Native for,结果(我用了你的高度8X):

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

Run Code Online (Sandbox Code Playgroud)

如果删除memset则:

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

Run Code Online (Sandbox Code Playgroud)

正如您所看到的memset(高度优化)可以响应大量的执行时间,这表明您的算法是如何以内存为界的.

完整的源代码:

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 2

Dar*_*usz 5

同步开销

我猜想循环的每次迭代完成的工作量都太少了。如果您将图像分为四个部分并并行运行计算，您会发现收益很大。尝试以这样一种方式设计循环，即减少迭代次数，并减少每次迭代的工作量。其背后的原因是完成了太多同步。

缓存使用率

一个重要因素可能是如何分割（分区）数据以进行处理。如果将处理的行分开（如以下最坏情况所示），则更多行将导致缓存未命中。由于每个行之间的距离将更大，因此此效果对于每个附加线程将变得更加重要。如果确定并行化功能执行了合理的分区，那么手动工作拆分将不会产生任何结果

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

Run Code Online (Sandbox Code Playgroud)

还要确保您以对齐数据的方式访问数据；这是可能的，每次调用offset[]和BayerChannel[]高速缓存未命中。您的算法占用大量内存。几乎所有操作都是访问内存段或对其进行写入。防止高速缓存未命中和最小化内存访问至关重要。

代码优化

下面显示的优化可能是由编译器完成的，可能不会给出更好的结果。值得一提的是，他们可以做到。

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

Run Code Online (Sandbox Code Playgroud)

“完成太多同步”-我认为英特尔TBB的[`parallel_for`]（http://threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_for_func.htm）足够聪明，可以进行正确的分区：“但是，对于效率，请确保parallel_for倾向于在连续运行值时进行操作” (2认同)
我不了解并发:: parallel_for，但是TBB不会在每次迭代中创建线程，而是将作业划分为“可用核心”数量。至少那是我与Intel编译器11一起使用时的样子 (2认同)

归档时间：	12 年，9 月前
查看次数：	3380 次
最近记录：	12 年，9 月前