教堂中的正交递归对分(Barnes-Hut算法)

Rok*_*sel 5 parallel-processing chapel

我正在Chapel中实现Barnes-Hut n体仿真的分布式版本。我已经实现了GitHub上可用的顺序和共享内存版本。

我正在遵循此处概述的算法(第7章):

  1. 执行正交递归二等分并分配主体,以使每个过程的工作量相等
  2. 在每个过程上构造本地必不可少的树
  3. 计算力量和推进机构

我对如何在C / MPI中实现MPI_Allreduce用于二等分和简单消息传递以在进程之间进行通信(用于主体传输)有一个很好的想法。这也是MPI_Comm_split一个非常方便的功能,可让我在ORB的每个步骤中拆分进程。

我在使用Chapel提供的并行/分布式结构执行ORB时遇到了一些麻烦。我需要某种方式来汇总(减少)跨流程(在Chapel中的语言环境)的工作,将流程分为多个组,并进行流程间的通信以转移主体。

对于在教堂中如何实施此建议,我将不胜感激。如果另一种方法对Chapel更好,那也很好。

Rok*_*sel 2

After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel

I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.

In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.

I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:

var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);

var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;

for i in 0..#numLocales-1 {
    var intendedReceiver = (rank + i + 1) % numLocales;
    var partner = ((rank - (i + 1)) + numLocales) % numLocales;

    // Wait until the previous value is read
    if (arr$[rank].isFull) {
        arr$[rank];
    }

    // Store my value
    arr[rank] = valueIWantToSend;
    arr$[rank] = intendedReceiver;

    // Am I the intended receiver?
    while (arr$[partner].readFF() != rank) {}

    // Read partner value
    var partnerValue = arr[partner];

    // Do something with partner value

    arr$[partner]; // empty

    // Reset write, which blocks until arr$[rank] is empty
    arr$[rank] = -1;
}
Run Code Online (Sandbox Code Playgroud)

This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern"). Overview:

  • First each locale calculates its intended receiver and its partner (where the locale will read a value from)

  • Locale checks if the previous value was read

  • Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver

  • Locale waits while its partner sets the value and sets the appropriate intended receiver

  • Once the locale is the intended receiver it reads the partner value and does some operation on it

  • Then locale empties/unlocks the value by reading arr$[partner]

  • Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver

I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.

That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.