使用OpenMP-Tasks的生产者 - 消费者

Question

使用OpenMP-Tasks的生产者 - 消费者

ees*_*_cu 5 c multithreading task openmp

我正在尝试使用OpenMP中的任务实现并行算法.并行编程模式基于生产者 - 消费者的想法,但由于消费者流程比生产者慢,我想使用一些生产者和几个消费者.主要思想是创建与生产者一样多的OS线程,然后每个线程创建要并行完成的任务(由消费者完成).每个生产者都将与相应数量的消费者(即numCheckers/numSeekers)相关联.我在英特尔双芯片服务器上运行该算法,每个芯片有6个内核.问题在于,当我只使用一个生产者(寻求者)和越来越多的消费者(检查员)时,随着消费者数量的增长,性能衰退得很快(见下表),即使正确的核心数量在100%.另一方面,如果我增加生产者的数量,平均时间减少或至少保持稳定,即使消费者数量成比例.在我看来,所有的改进都是通过生产者之间的输入划分来实现的,而且任务只是烦恼.但同样,我对一个生产者的行为没有任何解释.我在OpenMP-Task逻辑中遗漏了什么？难道我做错了什么？

-------------------------------------------------------------------------
|   producers   |   consumers   |   time        |
-------------------------------------------------------------------------
|       1       |       1       |   0.642935    |
|       1       |       2       |   3.004023    |
|       1       |       3       |   5.332524    |
|       1       |       4       |   7.222009    |
|       1       |       5       |   9.472093    |
|       1       |       6       |   10.372389   |
|       1       |       7       |   12.671839   |
|       1       |       8       |   14.631013   |
|       1       |       9       |   14.500603   |
|       1       |      10       |   18.034931   |
|       1       |      11       |   17.835978   |
-------------------------------------------------------------------------
|       2       |       2       |   0.357881    |
|       2       |       4       |   0.361383    |
|       2       |       6       |   0.362556    |
|       2       |       8       |   0.359722    |
|       2       |      10       |   0.358816    |
-------------------------------------------------------------------------

Run Code Online (Sandbox Code Playgroud)

我的代码的主要部分是休闲:

int main( int argc, char** argv) {

  // ... process the input (read from file, etc...)

  const char *buffer_start[numSeekers];
  int buffer_len[numSeekers];

  //populate these arrays dividing the input
  //I need to do this because I need to overlap the buffers for
  //correctness, so I simple parallel-for it's not enough 

  //Here is where I create the producers
  int num = 0;
  #pragma omp parallel for num_threads(numSeekers) reduction(+:num)
  for (int i = 0; i < numSeekers; i++) {
      num += seek(buffer_start[i], buffer_len[i]);
  }

  return (int*)num;
}

int seek(const char* buffer, int n){

  int num = 0;

  //asign the same number of consumers for each producer 
  #pragma omp parallel num_threads(numCheckers/numSeekers) shared(num)
  {
    //only one time for every producer
    #pragma omp single
    {
      for(int pos = 0; pos < n; pos += STEP){
    if (condition(buffer[pos])){
      #pragma omp task shared(num)
      {
        //check() is a sequential function
        num += check(buffer[pos]);
      }
    }
      }
      #pragma omp taskwait
    }
  return num;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Hri*_*iev 5

观察到的行为是由于您没有parallel启用嵌套区域.会发生的是,在第一种情况下,您实际上遇到了OpenMP任务的巨大开销.这很可能是因为check()与OpenMP运行时引入的开销相比,没有做足够的工作.为什么它与1和2个生产者一样？

当只使用一个生成器运行时,外部parallel区域仅使用一个线程执行.根据OpenMP API规范,这些parallel区域是非活动的,它们只是串行执行代码(唯一的开销是附加函数调用和通过指针访问共享变量).在这种情况下,内部parallel区域虽然在嵌套并行性被禁用时被嵌套,但是变为活动状态并且刺激了许多任务.任务引入了相对较高的开销,这种开销随着线程数的增加而增加.对于1个消费者,内部parallel区域也是不活动的,因此串行运行而没有任务开销.

当使用两个生成器运行时,外部parallel区域处于活动状态,因此内部parallel区域呈现为非活动状态(请记住 - 未启用嵌套并行性),因此,根本不会创建任何任务 - seek()只需串行运行.没有任务开销,代码运行速度几乎是1生产者/ 1消费者案例的两倍.运行时间不依赖于使用者的数量,因为无论指定了多少个线程,内部parallel区域始终处于非活动状态.

分配变量的任务和协调访问引入的开销有多大？我创建了一个简单的综合基准测试,执行以下代码:

for (int i = 0; i < 10000000; i++) {
   ssum += sin(i*0.001);
}

Run Code Online (Sandbox Code Playgroud)

在Westmere CPU上执行此操作的时间不到一秒,默认优化级别为GCC 4.7.2.然后我使用简单的atomic结构介绍了任务来保护对共享变量的访问ssum:

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         #pragma omp atomic
         ssum += sin(i*0.001);
      }
   }
}

Run Code Online (Sandbox Code Playgroud)

(taskwait这里不需要,因为在parallel区域末端的隐式屏障处有一个调度点)

我还创建了一个更复杂的变体,它以与Massimiliano提出的方式相同的方式执行缩减:

#define STRIDE 8

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         const int idx = omp_get_thread_num();
         ssumt[idx*STRIDE] += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   const int idx = omp_get_thread_num();
   #pragma omp atomic
   ssum += ssumt[idx*STRIDE];
}

Run Code Online (Sandbox Code Playgroud)

代码是用GCC 4.7.2编译的,如:

g++ -fopenmp -o test.exe test.cc

Run Code Online (Sandbox Code Playgroud)

在双插槽Westmere系统(总共12个核心)上以批处理模式运行它(因此没有其他进程可以介入),在插槽上有不同数量的线程和不同的线程放置,一个代码获得以下运行时间:

Configuration   ATOMIC       Reduction    ATOMIC slowdown
2 + 0            2,79 ±0,15   2,74 ±0,19   1,8%
1 + 1            2,72 ±0,21   2,51 ±0,22   8,4% <-----
6 + 0           10,14 ±0,01  10,12 ±0,01   0,2%
3 + 3           22,60 ±0,24  22,69 ±0,33  -0,4%
6 + 6           37,85 ±0,67  38,90 ±0,89  -2,7%

Run Code Online (Sandbox Code Playgroud)

(运行时间以秒为单位给出,测量方式为omp_get_wtime(),平均超过10次运行/标准偏差也显示/; x + y在Configuration列中表示x第一个插座y上的螺纹和第二个插座上的螺纹)

如您所见,任务的开销很大.它远远高于使用atomic而不是将减少应用于线程专用累加器的开销.此外,atomicwith 的赋值部分+=编译为锁定的比较和交换指令(LOCK CMPXCHG) - omp_get_thread_num()每次调用的开销不会高很多.

还应注意,双插槽Westmere系统是NUMA,因为每个CPU都有自己的内存,并且通过QPI链路访问另一个CPU的内存,因此延迟增加(并且带宽可能更低).由于ssum在这种atomic情况下共享变量,在第二个处理器上运行的线程实质上是在发出远程请求.尽管如此,两个代码之间的差异可以忽略不计(除了标记的双线程情况 - 我必须调查原因)并且atomic当线程数越来越高时,代码甚至开始优于具有减少的代码.

On multiscale NUMA systems the synchronisation in the atomic approach might become more of a burden, since it adds locking overhead to the already slower remote accesses. One such system is one of our BCS-coupled nodes. BCS (Bull Coherence Switch) is a proprietary solution from Bull that uses XQPI (eXternal QPI) to link together several Nehalem-EX boards into a single system, introducing three levels of NUMA in the way (local memory; remote memory on the same board; remote memory on a remote board). When running on one such system, consisting of 4 boards with 4 octocore Nehalem-EX CPUs each (128 cores in total), the atomic executable runs for 1036 s (!!), while the reduction approach runs for 1047 s, i.e. both still execute for about the same time (my previous statement that the atomic由于测量期间的OS服务抖动,方法慢了21.5%.这两个数字都来自单次运行,因此不具有代表性.请注意,此系统上的XQPI链路引入非常高延迟的板间QPI消息,并因此锁定是非常昂贵的,但不是说贵.部分开销可以通过使用还原来消除,但必须正确实施.首先,reduce变量的本地副本也应该是线程执行的NUMA节点的本地副本,其次,应该找到一种不进行调用的方法omp_get_thread_num().这两个可以通过许多不同的方式实现,但最简单的方法就是使用threadprivate变量:

static double ssumt;
#pragma omp threadprivate(ssumt)

#pragma omp parallel
{
   ssumt = 0.0;

   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         ssumt += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   #pragma omp atomic
   ssum += ssumt;
}

Run Code Online (Sandbox Code Playgroud)

Access to ssumt needs no protection as two tasks seldom execute simultaneously in the same thread (have to further investigate if this conforms to the OpenMP specs). This version of the code executes for 972 s. Once again this is not that far from 1036 s and comes from a single measurement only (i.e. it could be simply a statistical fluctuation), but in theory, it should be faster.

Lessons to take home:

Read the OpenMP specification about nesting parallel regions. Nested parallelism is enabled either by setting the environment variable OMP_NESTED to true or by calling omp_set_nested(1);. If enabled, the level of active nesting can be controlled by the OMP_MAX_ACTIVE_LEVELS as pointed out by Massimiliano.
留意数据竞赛,并尝试使用最简单的方法来阻止它们.并非每次使用更复杂的方法都能为您带来更好的性能.
特殊系统通常需要特殊编程.
如有疑问,请使用线程检查工具(如果有).英特尔有一个(商业),而Oracle的Solaris Studio(以前称为Sun Studio)也有一个(免费;有Linux版本,尽管有产品名称).
记住开销!尝试以足够大的块分割作业,以便创建数百万个任务的开销不会否定获得的并行增益.

归档时间：	13 年，3 月前
查看次数：	5309 次
最近记录：	7 年，2 月前