Openmp 嵌套循环

Bil*_*ler 3 c loops nested openmp

只是在玩openmp。看看这段代码片段:

#pragma omp parallel
{
    for( i =0;i<n;i++)
    {
        doing something
    }
}
Run Code Online (Sandbox Code Playgroud)

for( i =0;i<n;i++)
{
  #pragma omp parallel
  {
     doing something
  }
}
Run Code Online (Sandbox Code Playgroud)

为什么第一个比第二个慢很多(大约 5 倍)?从理论上讲,我认为第一个必须更快,因为并行区域只创建一次,而不像第二个那样创建 n 次?谁可以给我解释一下这个?

我要并行化的代码具有以下结构:

for(i=0;i<n;i++) //wont be parallelizable
{
  for(j=i+1;j<n;j++)  //will be parallelized
  {
    doing sth.
  }

  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }

}
Run Code Online (Sandbox Code Playgroud)

我制作了一个简单的程序来测量时间并重现我的结果。

#include <stdio.h>
#include <omp.h>

void test( int n)
{
  int i ;
  double t_a = 0.0, t_b = 0.0 ;


  t_a = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_b = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
}

int main(void)
{
  int i, n   ;
  double t_1 = 0.0, t_2 = 0.0 ;

  printf( "n: " ) ;
  scanf( "%d", &n ) ;

  t_1 = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_2 = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;

  test(n) ;

  return 0 ;
}
Run Code Online (Sandbox Code Playgroud)

如果我用不同的 n 开始它,我总是会得到不同的结果。

n: 30000
directive outside for-loop: 0.881884
directive inside for-loop: 0.073054 
directive outside for-loop: 0.049098
directive inside for-loop: 0.011663 

n: 30000
directive outside for-loop: 0.402774
directive inside for-loop: 0.071588 
directive outside for-loop: 0.049168
directive inside for-loop: 0.012013 

n: 30000
directive outside for-loop: 2.198740
directive inside for-loop: 0.065301 
directive outside for-loop: 0.047911
directive inside for-loop: 0.012152 



n: 1000
directive outside for-loop: 0.355841
directive inside for-loop: 0.079480 
directive outside for-loop: 0.013549
directive inside for-loop: 0.012362 

n: 10000
directive outside for-loop: 0.926234
directive inside for-loop: 0.071098 
directive outside for-loop: 0.023536
directive inside for-loop: 0.012222 

n: 10000
directive outside for-loop: 0.354025
directive inside for-loop: 0.073542 
directive outside for-loop: 0.023607
directive inside for-loop: 0.012292 
Run Code Online (Sandbox Code Playgroud)

你如何向我解释这种差异?!

结果与您的版本:

Input n: 1000
[2] directive outside for-loop: 0.331396
[2] directive inside for-loop: 0.002864 
[2] directive outside for-loop: 0.011663
[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257 
Run Code Online (Sandbox Code Playgroud)

osg*_*sgx 5

因为并行区域只创建一次而不是像第二次那样创建 n 次?

的种类。那个工程

#pragma omp parallel
{
}
Run Code Online (Sandbox Code Playgroud)

也意味着将工作项分配给 '{' 上的线程并将线程返回到 '}' 上的线程池中。它有很多线程到线程的通信。此外,默认情况下,等待线程将通过操作系统进入睡眠状态,唤醒线程需要一些时间。

关于你的中间样本:你可以尝试限制 outerfor的平行性...

#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
  #pragma omp for
  for(j=i+1;j<n,j++)  //will be parallelized
  {
    doing sth.
  }
  #pragma omp for    
  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }
  // Is there really nothing? - if no - use:
  // won't be parallelized
  #pragma omp single
  { //seq part of outer loop
      printf("Progress... %i\n", i); fflush(stdout);
  }

  // here is the point. Every thread did parallel run of outer loop, but...
  #pramga omp barrier

  //  all loop iterations are syncronized:
  //       thr0   thr1  thr2
  // i      0      0     0
  //     ----   barrier ----
  // i      1      1     1
  //     ----   barrier ----
  // i      2      2     2
  // and so on
}
}
Run Code Online (Sandbox Code Playgroud)

一般情况下,放置在平行性最高(上部)可能forfor巢是不是将其放置在内部循环更好。如果您需要顺序执行某些代码,请为此代码使用高级编译指示(如omp barrier,omp masteromp single)或 omp_locks。这种方式中的任何一种都会比omp parallel多次启动要快