有没有办法在Chapel中自定义整个数组语句的默认并行化行为？

Question

有没有办法在Chapel中自定义整个数组语句的默认并行化行为？

根据Chapel的可用文档,(整个)数组语句如

A = B + alpha * C;   // with A, B, and C being arrays, and alpha some scalar

Run Code Online (Sandbox Code Playgroud)

在语言中实现如下的forall迭代:

forall (a,b,c) in zip(A,B,C) do
   a = b + alpha * c;

Run Code Online (Sandbox Code Playgroud)

因此,默认情况下,数组语句由并行线程团队执行.不幸的是,这似乎也完全排除了这些陈述的(部分或完全)矢量化.对于习惯于Fortran或Python/Numpy等语言的程序员而言,这可能会带来性能上的惊喜(默认行为通常是只对数组语句进行矢量化).

对于使用(整数)数组语句与小到中等大小的数组的代码,矢量化的丢失(由Linux硬件性能计数器确认)和并行线程固有的显着开销(不适合有效地利用细粒度数据) - 在这些问题中可用的并行性)可能导致性能的显着损失.作为一个例子,考虑以下版本的Jacobi迭代,它们都解决了300 x 300区域的相同问题:

Jacobi_1 使用数组语句,如下所示:

/*
 *  Jacobi_1
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (whole-)array statements.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  X[n+1,1..n] = 1.0;                 // Set south boundary values to 1.0

  do {

    // compute next approximation
    XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

    // update X with next approximation
    X[1..n,1..n] = XNew[1..n,1..n];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Run Code Online (Sandbox Code Playgroud)

Jacobi_2 在整个过程中使用串行for循环(即只允许后端C编译器进行(自动)矢量化):

/*
 *  Jacobi_2
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (serial) for-loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in 1..n do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    for i in 1..n do
      for j in 1..n do  
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    for i in 1..n do
      for j in 1..n do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Run Code Online (Sandbox Code Playgroud)

Jacobi_3最后,最内层的循环被矢量化,只有最外面的循环线程化:

/*
 *  Jacobi_3
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using both parallel and serial (vectorized) loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in vectorizeOnly(1..n) do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Run Code Online (Sandbox Code Playgroud)

在具有2个处理器核心和使用两个并行线程的笔记本电脑上运行这些代码,人们发现它Jacobi_1(令人惊讶地)慢了十倍以上Jacobi_2,其本身(预期)慢了约1.6倍Jacobi_3.

不幸的是,这种默认行为使得数组语句对于我的用例完全没有吸引力,即使对于可以从更简洁的表示法和(整个)数组语句可以提供的可读性中获益的算法也是如此.

Chapel中的用户是否有办法更改此默认行为？也就是说,用户是否可以自定义整个数组语句的默认并行化,使得这样的数组语句的Jacobi_1行为类似于代码Jacobi_2(对代码开发和调试有用)或代码in Jacobi_3(其中三个是生产计算的首选方法)？

我试图通过将调用" vectorizeOnly()"插入上面的"域"定义来实现这一点,但无济于事.

Answer 1

Bra*_*rad 2

Chapel 的目的是在用于实现循环的每任务串行循环中自动支持矢量化forall（对于可合法矢量化的情况）。然而，正如您所指出的那样，这种功能今天并没有得到很好的支持（甚至vectorizeOnly()您正在使用的迭代器也仅被视为原型）。

我要提到的是，使用 Chapel 的 LLVM 后端时，我们往往会看到比使用（默认）C 后端更好的矢量化结果，并且在使用 Simon Moll 的基于 LLVM 的区域矢量化器时，我们看到了更好的结果（萨尔大学）。但我们也看到过 LLVM 后端性能低于 C 后端的情况，因此您的情况可能会有所不同。但如果您关心矢量化，那么值得一试。

对于你的具体问题：

Chapel 中的用户有办法更改此默认行为吗？

有。对于显式forall循环，您可以编写自己的并行迭代器，它可用于指定forall与默认迭代器使用的不同的循环实现策略。如果您实现了您喜欢的一个，则可以编写（或克隆和修改）域映射（此处为背景）来控制默认情况下如何实现给定数组上的循环（即，如果没有显式调用迭代器）。这允许最终用户为 Chapel 阵列指定与我们默认支持的策略不同的实施策略。

关于您的三个代码变体，我注意到第一个使用多维拉链，目前已知它存在严重的性能问题。这可能是它与其他产品之间性能差异的主要原因。例如，我怀疑如果您使用表单重写它forall (i,j) in Domain ...，然后使用每个维度 +/-1 索引，您会看到显着的改进（而且，我猜，性能与第三种情况更具可比性）。

对于第三个，我很好奇您所看到的好处是由于矢量化还是仅仅由于多任务处理，因为您已经避免了第一个的性能问题和第二个的串行实现。例如，您是否检查过使用 vectorizeOnly() 迭代器是否比没有该迭代器的相同代码增加了任何性能改进（或者在二进制文件上使用了工具来检查矢量化是否正在发生？）

在任何 Chapel 性能研究中，请确保抛出--fast编译器标志。同样，为了获得最佳矢量化结果，您可以尝试 LLVM 后端。

归档时间：	7 年，3 月前
查看次数：	123 次
最近记录：	7 年，3 月前