使用MPI_Put的异步有限差分格式

k1n*_*ext 9 c parallel-processing asynchronous mpi mpi-rma

Donzis&Aditya的一篇论文表明,有可能使用可能在模板上有延迟的有限差分方案.这是什么意思?FD方案可能用于解决热方程和读取(或简化它)

u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])
Run Code Online (Sandbox Code Playgroud)

意思是,下一个时间步的值取决于同一位置的值及其前一时间步的邻居.

通过将(在我们的示例中为1D)域分割到不同的处理器上,可以容易地将该问题平行化.但是,在计算处理器的边界节点时我们需要通信,因为该元素u[t,i+-1]仅在另一个处理器上可用.

问题在下图中说明,该图取自引用的论文.

在此输入图像描述

MPI实现可能使用MPI_SendMPI_Recv进行同步计算.由于计算本身相当容易,因此通信可能成为可能的瓶颈.

该论文给出了该问题的解决方案:

尽管它可能是较早时间步长的值,但只需获取可用的边界音符而不是同步过程.然后该方法仍然收敛(在某些假设下)

对于我的工作,我想实现异步MPI案例(这不是本文的一部分).同步部件使用MPI_Send并且MPI_Recv工作正常.我将内存扩展了两个元素作为相邻元素的ghost单元格,并通过发送和接收发送所需的值.下面的代码基本上是上图的实现,并且在计算之前的每个时间步骤期间执行.

MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Run Code Online (Sandbox Code Playgroud)

现在,我绝不是MPI专家.我想,这MPI_Put可能是我需要的异步情况和阅读一点,我想出了以下实现.

在时间循环之前:

MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);
Run Code Online (Sandbox Code Playgroud)

在时间循环内:

MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];
Run Code Online (Sandbox Code Playgroud)

它将所需的元素放在窗口中,即boundary(具有两个元素的数组)在相邻的处理器上u[0]u[NpP+1]boundary数组本身获取值.这个实现工作正常,我得到了相同的结果MPI_Send/Recv.但是,由于我还在使用MPI_Win_fence,这并不是真正的异步,据我所知,这确保了同步.

问题是:如果我取出MPI_Win_fence里面的值boundary永远不会更新并保持初始值.我的理解是,如果没有MPI_Win_fence你会采取任何boundary可能(或可能不)由相邻处理器更新的值.

有没有人有想法避免使用MPI_Win_fence同时也解决问题,内部的值boundary永远不会更新?

我也不确定,如果我提供的代码足以理解我的问题或提供任何提示.如果是这种情况,请随意询问,因为我会尝试添加所有缺失的部分.

Jon*_*rsi 2

从正确执行的意义上来说,以下工作似乎对我有用 - 一个取自我们教程之一的小型一维热方程,用于 RMA 内容:

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]),         1, MPI_FLOAT, left,  0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0]           = *leftgc;
MPI_Win_unlock( rank, leftwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
Run Code Online (Sandbox Code Playgroud)

在代码中,我什至让排位在每个时间步长上等待额外 10 毫秒,以确保事情不同步;但从痕迹来看,事情实际上看起来仍然非常同步。我不知道这种高度同步是否可以通过调整代码来修复,或者是实现的限制(IntelMPI 5.0.1),或者只是因为计算中传递的时间太少而通信时间太长而发生占主导地位(但对于最后一个,增加睡眠间隔似乎没有效果)。

#define _BSD_SOURCE     /* usleep */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>


int main(int argc, char **argv) {
    /* simulation parameters */
    const int totpoints=1000;
    int locpoints;
    const float xleft = -12., xright = +12.;
    float locxleft, locxright;
    const float kappa = 1.;

    const int nsteps=100;

    /* data structures */
    float *x;
    float **temperature;

    /* parameters of the original temperature distribution */
    const float ao=1., sigmao=1.;

    float fixedlefttemp, fixedrighttemp;

    int current, new;
    int step, i;
    float time;
    float dt, dx;
    float rms;

    int rank, size;
    int start,end;
    int left, right;
    int lefttag=1, righttag=2;

    /* MPI Initialization */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD,&size);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

    locpoints = totpoints/size;
    start = rank*locpoints;
    end   = (rank+1)*locpoints - 1;
    if (rank == size-1)
        end = totpoints-1;
    locpoints = end-start+1;

    left = rank-1;
    if (left < 0) left = MPI_PROC_NULL;
    right= rank+1;
    if (right >= size) right = MPI_PROC_NULL;

    #ifdef ONESIDED
    if (rank == 0)
        printf("Onesided: Allocating windows\n");
    MPI_Win leftwin, rightwin;
    float *leftgc, *rightgc;
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc,  &leftwin);
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
    #endif
    /* set parameters */

    dx = (xright-xleft)/(totpoints-1);
    dt = dx*dx * kappa/10.;

    locxleft = xleft + start*dx;
    locxright = xleft + end*dx;

    x      = (float *)malloc((locpoints+2)*sizeof(float));
    temperature = (float **)malloc(2 * sizeof(float *));
    temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
    temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
    current = 0;
    new = 1;

    /* setup initial conditions */

    time = 0.;
    for (i=0; i<locpoints+2; i++) {
        x[i] = locxleft + (i-1)*dx;
        temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
    }
    fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
    fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
    #ifdef ONESIDED
    *leftgc  = fixedlefttemp;
    *rightgc = fixedrighttemp;
    #endif

    /* evolve */
    for (step=0; step < nsteps; step++) {
        /* boundary conditions: keep endpoint temperatures fixed. */

        #ifdef ONESIDED
            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
            MPI_Put(&(temperature[current][1]),         1, MPI_FLOAT, left,  0, 1, MPI_FLOAT, rightwin);
            MPI_Win_unlock( left, rightwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
            MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
            MPI_Win_unlock( right, leftwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
            temperature[current][0]           = *leftgc;
            MPI_Win_unlock( rank, leftwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
            temperature[current][locpoints+1] = *rightgc;
            MPI_Win_unlock( rank, rightwin );
        #else
            temperature[current][0] = fixedlefttemp;
            temperature[current][locpoints+1] = fixedrighttemp;

            /* send data rightwards */
            MPI_Sendrecv(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, righttag,
                         &(temperature[current][0]), 1, MPI_FLOAT, left,  righttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

            /* send data leftwards */
            MPI_Sendrecv(&(temperature[current][1]), 1, MPI_FLOAT, left, lefttag,
                         &(temperature[current][locpoints+1]), 1, MPI_FLOAT, right,  lefttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        #endif

        for (i=1; i<locpoints+1; i++) {
            temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
                (temperature[current][i+1] - 2.*temperature[current][i] +
                 temperature[current][i-1]) ;
        }

        time += dt;

        if ((rank % 2) == 0)
            usleep(10000u);

        current = new;
        new = 1 - current;
    }

    rms  = 0.;
    for (i=1;i<locpoints+1;i++) {
        rms += (temperature[current][i])*(temperature[current][i]);
    }
    float totrms;
    MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        totrms = sqrt(totrms/totpoints);
        printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
    }


    #ifdef ONESIDED
    MPI_Win_free(&leftwin);
    MPI_Win_free(&rightwin);
    #endif

    free(temperature[1]);
    free(temperature[0]);
    free(temperature);
    free(x);

    MPI_Finalize();
    return 0;
}
Run Code Online (Sandbox Code Playgroud)