为什么 OpenMP 的性能优于线程？

Question

为什么 OpenMP 的性能优于线程？

我一直在 OpenMP 中调用它

#pragma omp parallel for num_threads(totalThreads)
for(unsigned i=0; i<totalThreads; i++)
{
workOnTheseEdges(startIndex[i], endIndex[i]);
}

Run Code Online (Sandbox Code Playgroud)

这在 C++11 std::threads 中（我相信这些只是 pthreads）

vector<thread> threads;
for(unsigned i=0; i<totalThreads; i++)
{
threads.push_back(thread(workOnTheseEdges,startIndex[i], endIndex[i])); 
}
for (auto& thread : threads)
{
 thread.join();
}

Run Code Online (Sandbox Code Playgroud)

但是，OpenMP 实现的速度是原来的 2 倍——更快！我本来期望 C++11 线程更快，因为它们更底层。注意：上面的代码不仅被调用一次，而且可能在循环中被调用 10,000 次，所以也许这与它有关？

编辑：为了澄清，在实践中，我要么使用 OpenMP 要么使用 C++11 版本——而不是同时使用两者。当我使用 OpenMP 代码时，需要 45 秒，当我使用 C++11 时，需要 100 秒。

Answer 1

adp*_*mbo 5

您的 OpenMP 版本来自哪里totalThreads？我敢打赌不是startIndex.size()。

OpenMP 版本将请求排队到totalThreads工作线程上。看起来 C++11 版本创建了startIndex.size()线程，如果这个数字很大的话，这会涉及到大量的开销。

Answer 2

use*_*666 3

考虑以下代码。OpenMP 版本的运行时间为 0 秒，而 C++11 版本的运行时间为 50 秒。这不是因为函数不做任何事情，也不是因为向量在循环内。正如您可以想象的那样，c++11 线程在每次迭代中被创建然后被销毁。另一方面，OpenMP 实际上实现了线程池。它不在标准中，但在 Intel 和 AMD 的实现中。

for(int j=1; j<100000; ++j)
{
    if(algorithmToRun == 1)
    {
        vector<thread> threads;
        for(int i=0; i<16; i++)
        {
            threads.push_back(thread(doNothing));
        }
        for(auto& thread : threads) thread.join();
    }
    else if(algorithmToRun == 2)
    {
        #pragma omp parallel for num_threads(16)
        for(unsigned i=0; i<16; i++)
        {
            doNothing();
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，6 月前
查看次数：	3381 次
最近记录：	11 年，2 月前