如何有效地并行化分而治之算法？

Question

如何有效地并行化分而治之算法？

ere*_*eOn 9 c++ sorting parallel-processing multithreading c++11

过去几天我一直在清理我对排序算法的记忆,而且我遇到了一个我无法找到最佳解决方案的情况.

我写了一个quicksort的基本实现,我想通过并行执行来提高性能.

我得到的是:

template <typename IteratorType>
void quicksort(IteratorType begin, IteratorType end)
{
  if (distance(begin, end) > 1)
  {
    const IteratorType pivot = partition(begin, end);

    if (distance(begin, end) > 10000)
    {
      thread t1([&begin, &pivot](){ quicksort(begin, pivot); });
      thread t2([&pivot, &end](){ quicksort(pivot + 1, end); });

      t1.join();
      t2.join();
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

虽然这比天真的"无线程"实现更好,但这具有严重的局限性,即:

如果要排序的数组太大或者递归太深,系统可能会用尽线程并且执行失败.
可能可以避免在每个递归调用中创建线程的成本,特别是考虑到线程不是无限资源.

我想使用线程池来避免后期线程创建,但我面临另一个问题:

我创建的大多数线程最初都是完成所有工作,然后在等待完成时不做任何事情.这导致很多线程只是等待子调用完成,这似乎是次优的.

是否有一种技术/实体可以用来避免浪费线程(允许重用)？

我可以使用boost或任何C++ 11工具.

Answer 1

And*_*zos 6

如果要排序的数组太大或者递归太深,系统可能会用尽线程并且执行失败.

所以在最大深度后顺序...

template <typename IteratorType>
void quicksort(IteratorType begin, IteratorType end, int depth = 0)
{
  if (distance(begin, end) > 1)
  {
    const IteratorType pivot = partition(begin, end);

    if (distance(begin, end) > 10000)
    {
      if (depth < 5) // <--- HERE
      { // PARALLEL
        thread t1([&begin, &pivot](){ quicksort(begin, pivot, depth+1); });
        thread t2([&pivot, &end](){ quicksort(pivot + 1, end, depth+1); });

        t1.join();
        t2.join();
      }
      else
      { // SEQUENTIAL
        quicksort(begin, pivot, depth+1);
        quicksort(pivot + 1, end, depth+1);
      }
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

随着depth < 5进一步的并行性将产生任何好处-它最大〜50个线程,这将很容易饱和大多数多核CPU将创建.

可能可以避免在每个递归调用中创建线程的成本,特别是考虑到线程不是无限资源.

睡眠线程的成本并不像人们想象的那么多,但是在每个分支上创建两个新线程没有意义,可以重用当前线程,而不是让它睡觉......

template <typename IteratorType>
void quicksort(IteratorType begin, IteratorType end, int depth = 0)
{
  if (distance(begin, end) > 1)
  {
    const IteratorType pivot = partition(begin, end);

    if (distance(begin, end) > 10000)
    {
      if (depth < 5)
      {
        thread t1([&begin, &pivot](){ quicksort(begin, pivot, depth+1); });
        quicksort(pivot + 1, end, depth+1);   // <--- HERE

        t1.join();
      } else {
        quicksort(begin, pivot, depth+1);
        quicksort(pivot + 1, end, depth+1);
      }
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

作为使用的替代方法depth,您可以设置全局线程限制,然后仅在未达到限制时创建新线程(如果有),而不是按顺序执行.这个线程限制可以是进程范围的,因此对quicksort的并行调用将从创建太多线程的协同退避.

@ user1131467如果你正在处理纯粹的平面数据并行性,那没关系,但这不是我们在这里处理的,这不是我所说的.使用这种方法编写递归/嵌套数据并行算法很差,并且是一种效率低下的方法,众所周知,它存在各种问题,过度订阅,负载均衡不佳等等.所以是的,你可以在这种情况下做得更好并且它不是"魔术"(我发现这是非常侮辱性的)有关于这个主题的各种论文和实现,当前领先的一个是工作窃取算法. (2认同)
@ user1131467但这种方法不是最优的!(你发布的代码)并且不能很好地扩展,这是我的观点!OP甚至知道这一点.线程池仅有助于降低线程创建的成本,但您仍然使用可变数量的线程,具体取决于递归的宽度/深度.如果您使用基于作业/任务的系统,则无论生成多少任务,总会有固定数量的工作线程.线程计数是固定的,通常等于硬件线程的数量.我在游戏行业工作,我们从不编写代码中的并行算法. (2认同)

归档时间：	12 年，6 月前
查看次数：	2415 次
最近记录：	11 年，5 月前