Bil*_*ler 3 c loops nested openmp
只是在玩openmp。看看这段代码片段:
#pragma omp parallel
{
for( i =0;i<n;i++)
{
doing something
}
}
Run Code Online (Sandbox Code Playgroud)
和
for( i =0;i<n;i++)
{
#pragma omp parallel
{
doing something
}
}
Run Code Online (Sandbox Code Playgroud)
为什么第一个比第二个慢很多(大约 5 倍)?从理论上讲,我认为第一个必须更快,因为并行区域只创建一次,而不像第二个那样创建 n 次?谁可以给我解释一下这个?
我要并行化的代码具有以下结构:
for(i=0;i<n;i++) //wont be parallelizable
{
for(j=i+1;j<n;j++) //will be parallelized
{
doing sth.
}
for(j=i+1;j<n;j++) //will be parallelized
for(k = i+1;k<n;k++)
{
doing sth.
}
}
Run Code Online (Sandbox Code Playgroud)
我制作了一个简单的程序来测量时间并重现我的结果。
#include <stdio.h>
#include <omp.h>
void test( int n)
{
int i ;
double t_a = 0.0, t_b = 0.0 ;
t_a = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++)
{
}
}
t_b = omp_get_wtime() ;
for(i=0;i<n;i++)
{
#pragma omp parallel
{
}
}
printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
}
int main(void)
{
int i, n ;
double t_1 = 0.0, t_2 = 0.0 ;
printf( "n: " ) ;
scanf( "%d", &n ) ;
t_1 = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++)
{
}
}
t_2 = omp_get_wtime() ;
for(i=0;i<n;i++)
{
#pragma omp parallel
{
}
}
printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;
test(n) ;
return 0 ;
}
Run Code Online (Sandbox Code Playgroud)
如果我用不同的 n 开始它,我总是会得到不同的结果。
n: 30000
directive outside for-loop: 0.881884
directive inside for-loop: 0.073054
directive outside for-loop: 0.049098
directive inside for-loop: 0.011663
n: 30000
directive outside for-loop: 0.402774
directive inside for-loop: 0.071588
directive outside for-loop: 0.049168
directive inside for-loop: 0.012013
n: 30000
directive outside for-loop: 2.198740
directive inside for-loop: 0.065301
directive outside for-loop: 0.047911
directive inside for-loop: 0.012152
n: 1000
directive outside for-loop: 0.355841
directive inside for-loop: 0.079480
directive outside for-loop: 0.013549
directive inside for-loop: 0.012362
n: 10000
directive outside for-loop: 0.926234
directive inside for-loop: 0.071098
directive outside for-loop: 0.023536
directive inside for-loop: 0.012222
n: 10000
directive outside for-loop: 0.354025
directive inside for-loop: 0.073542
directive outside for-loop: 0.023607
directive inside for-loop: 0.012292
Run Code Online (Sandbox Code Playgroud)
你如何向我解释这种差异?!
结果与您的版本:
Input n: 1000
[2] directive outside for-loop: 0.331396
[2] directive inside for-loop: 0.002864
[2] directive outside for-loop: 0.011663
[2] directive inside for-loop: 0.001188
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257
Run Code Online (Sandbox Code Playgroud)
因为并行区域只创建一次而不是像第二次那样创建 n 次?
的种类。那个工程
#pragma omp parallel
{
}
Run Code Online (Sandbox Code Playgroud)
也意味着将工作项分配给 '{' 上的线程并将线程返回到 '}' 上的线程池中。它有很多线程到线程的通信。此外,默认情况下,等待线程将通过操作系统进入睡眠状态,唤醒线程需要一些时间。
关于你的中间样本:你可以尝试限制 outerfor的平行性...
#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
#pragma omp for
for(j=i+1;j<n,j++) //will be parallelized
{
doing sth.
}
#pragma omp for
for(j=i+1;j<n;j++) //will be parallelized
for(k = i+1;k<n;k++)
{
doing sth.
}
// Is there really nothing? - if no - use:
// won't be parallelized
#pragma omp single
{ //seq part of outer loop
printf("Progress... %i\n", i); fflush(stdout);
}
// here is the point. Every thread did parallel run of outer loop, but...
#pramga omp barrier
// all loop iterations are syncronized:
// thr0 thr1 thr2
// i 0 0 0
// ---- barrier ----
// i 1 1 1
// ---- barrier ----
// i 2 2 2
// and so on
}
}
Run Code Online (Sandbox Code Playgroud)
一般情况下,放置在平行性最高(上部)可能for的for巢是不是将其放置在内部循环更好。如果您需要顺序执行某些代码,请为此代码使用高级编译指示(如omp barrier,omp master或omp single)或 omp_locks。这种方式中的任何一种都会比omp parallel多次启动要快