使用Repa修复"循环二维数组"的最佳方法

Question

使用Repa修复"循环二维数组"的最佳方法

iml*_*ris 30 arrays performance profiling haskell repa

我发现数组库Repa for Haskell非常有趣,并且想要制作一个简单的程序,试图了解如何使用它.我还使用列表进行了简单的实现,事实证明它更快.我的主要问题是如何改进下面的Repa代码以使其最有效(并且希望也非常易读).我是一个很新的使用Haskell,我找不到任何容易理解的关于Repa的教程[ 在Haskell Wiki上编辑一个,我在写这篇文章的时候有点忘记],所以不要以为我什么都知道.:)例如,我不确定何时使用force或deepSeqArray.

该程序用于以下列方式近似计算球体的体积:

指定球体的中心点和半径,以及长方体内等距离的坐标,已知这些坐标包围球体.
程序获取每个坐标,计算到球体中心的距离,如果它小于球体的半径,则用于累加球体的总(近似)体积.

下面显示了两个版本,一个使用列表,另一个使用repa.我知道代码是低效的,特别是对于这个用例,但是想法是让它在以后变得更复杂.

对于下面的值,并使用"ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded"进行编译,列表版本需要1.4秒,而修复版本需要12秒,+ RTS -N1和10秒+ + RTS - N2,虽然没有转换火花(我有一台双核Intel机器(Core 2 Duo E7400 @ 2.8 GHz)运行Windows 7 64,GHC 7.0.2和llvm 2.8).(注释下面主要的正确行,只运行其中一个版本.)

感谢您的任何帮助!

import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P

-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.


particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate

-- Radius of the particle
a = 4

-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]

-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z)  | x <- xrange, y <- yrange, z <- zrange]

---- List code ----

volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3

volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)

numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords

insideParticles particles coord =  any (==True) $ P.map (insideParticle coord) particles

insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----

---- Repa code ----

-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords

-- Total number of coordinates
num_coords = (length xcoords) ::Int

xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords

rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r

-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice

-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr

xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))

-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice

-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff

-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff

-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within 

-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)

---- End repa code ----

-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
    putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
    putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa

Run Code Online (Sandbox Code Playgroud)

编辑:一些背景解释了为什么我编写了上面的代码:

我主要在Matlab中编写代码,而我的性能改进经验主要来自该领域.在Matlab中,您通常希望使用直接在矩阵上运行的函数进行计算,以提高性能.我在Matlab R2010b中实现上述问题,使用下面显示的矩阵版本需要0.9秒,使用嵌套循环需要15秒.虽然我知道Haskell与Matlab有很大不同,但我希望从使用列表到在Haskell中使用Repa数组可以提高代码的性能.来自lists-> Repa arrays-> vectors的转换是因为我不够熟练,无法用更好的东西替换它们.这就是我要求输入的原因.:)上面的时间数因此是主观的,因为它可能比语言的能力更能衡量我的表现,但它现在对我来说是一个有效的指标,因为决定我将使用什么取决于我是否可以制作它的工作与否.

tl;博士:据我所知,我上面的维修代码可能是愚蠢的或病态的,但这是我现在能做的最好的.我希望能够编写更好的Haskell代码,我希望你可以帮助我朝这个方向努力(dons已经做过).:)

function archimedes_simple()

particles = [0 0 0]';
a = 4;

step = 0.1;

xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];

[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
    bsxfun(@minus,Y,particles(2)).^2+ ...
    bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));

disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);

end

Run Code Online (Sandbox Code Playgroud)

编辑2:维修代码的新版本,更快版本和更简单版本

我现在已经阅读了更多关于Repa的内容,并想了一下.以下是一个新的版本.在这种情况下,我使用Repa扩展函数从值列表中创建x,y和z坐标作为3-D数组(类似于ndgrid在Matlab中的工作方式).然后我映射这些数组以计算到球形粒子的距离.最后,我折叠得到的三维距离数组,计算球体内有多少个坐标,然后乘以一个常数因子得到近似体积.我的算法实现现在与上面的Matlab版本更相似,并且不再有任何向量转换.

新版本在我的计算机上运行大约5秒钟,从上面有了相当大的改进.如果我在编译时使用"线程",或者与"+ RTS -N2"结合使用时,时间是相同的,但是线程版本最大化了我的计算机的两个核心.但是,我确实看到几滴"-N2"运行到3.1秒,但后来无法再现它们.也许它对同时运行的其他进程非常敏感？我在基准测试时关闭了计算机上的大多数程序,但仍然有一些程序正在运行,例如后台进程.

如果我们使用"-N2"并添加运行时开关以关闭并行GC(-qg),则时间一直下降到~4.1秒,并使用-qa"使用OS来设置线程关联(实验)",时间减少到约3.5秒.查看使用"+ RTS -s"运行程序的输出,使用-qg执行更少的GC.

今天下午我将看看我是否可以在8核计算机上运行代码,只是为了好玩.:)

import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L

-- Calculate the volume of a spherical particle by putting it in a bath of coordinates.     Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is     inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to     find an approximate volume.

particles :: [(Double,Double,Double)]
particles = [(0,0,0)]

-- Radius of the spherical particle
a = 4

volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3

-- Generate the coordinates. 
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int

coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list

coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)

-- x, y and z are 3-D arrays, where the same index into each array can be used to find a     single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice

-- Calculate the squared distance from each coordinate to the center of the spherical     particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map (    squared_diff zp) z)) 
    where
        (xp,yp,zp) = particle
        squared_diff xi xa = (xa-xi)^2

-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force     $ dist2 particle)

-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles

main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
    putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . (    L.intersperse ", ") . P.map show) volume_inside

-- As an alternative, y and z could be generated from x, but this was slightly slower in     my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y

-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
    where
        e = extent a
        swap (Z :. i:. j :. k) = Z :. k :. i :. j

Run Code Online (Sandbox Code Playgroud)

新代码的空间分析

与Don Stewart相同类型的配置文件如下所示,但是对于新的Repa代码.

Answer 1

Don*_*art 64

代码审查说明

47.8%的时间花在GC上.
1.5G在堆上分配(!)
该repA的代码看起来很多比单码更加复杂.
发生了大量的并行GC
我可以在-N4机器上获得高达300%的效率
放入更多类型的签名将使分析更容易......
rsize 没用过(看起来很贵!)
你将修复数组转换为向量,为什么？
您的所有用途(**)可以由更便宜的替代(^)上Int.
有可疑数量的大型常量列表.这些都必须转换为数组 - 这似乎很昂贵.
any (==True) 是相同的 or

时间分析

COST CENTRE                    MODULE               %time %alloc

squared_diff                   Main                  25.0   27.3
insideParticle                 Main                  13.8   15.3
sum_squared_diff               Main                   9.8    5.6
rcoords                        Main                   7.4    5.6
particle_extended              Main                   6.8    9.0
particle_slice                 Main                   5.0    7.6
insideParticles                Main                   5.0    4.4
yslice                         Main                   3.6    3.0
xslice                         Main                   3.0    3.0
ssd_vec                        Main                   2.8    2.1
**^                            Main                   2.6    1.4

Run Code Online (Sandbox Code Playgroud)

表明,你的功能squared_diff有点可疑:

squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
                    ((force2 rcoords) -^ (force2 particle_extended)) **^ 2

Run Code Online (Sandbox Code Playgroud)

虽然我没有看到任何明显的修复.

空间剖析

在空间配置文件中没有什么太令人惊奇:您清楚地看到列表阶段,然后是矢量阶段.列表阶段分配了很多,并被回收.

在此输入图像描述

按类型分解堆,我们最初看到很多列表和元组被分配(按需),然后分配并保存了一大块数组:

在此输入图像描述

再次,有点像我们期望看到的......数组的东西不是比列表代码分配更多(事实上,总体上要少一点),但它只需要花费更长的时间来运行.

使用保留器分析检查空间泄漏:

在此输入图像描述

那里有一些有趣的东西,但没什么好吃的.zcoords在列表程序执行的长度内保留,然后为修复运行分配一些数组(SYSTEM).

检查核心

所以在这一点上我首先假设你确实在列表和数组中实现了相同的算法(即在数组的情况下没有进行额外的工作),并且没有明显的空间泄漏.所以我怀疑是非常优化的修复代码.让我们来看看核心(使用ghc-core.

基于列表的代码看起来很好.
数组代码看起来很合理(即出现未装箱的基元),但非常复杂,而且很多.

内联所有CAF

我为所有顶级数组定义添加了内联编译指示,希望删除一些CAFS,并让GHC更难以优化数组代码.这确实使GHC难以编译模块(在处理模块时分配高达4.3G和10分钟).这对我来说是一个线索,GHC以前不能很好地优化这个程序,因为当我增加阈值时,它会有新的东西.

操作

使用-H可以减少在GC中花费的时间.
尝试消除从列表到repas到向量的转换.
所有这些CAF(顶级常量数据结构)都有点奇怪 - 真正的程序不会是顶级常量列表 - 实际上,这个模块在病理学上是这样的,导致很多值在很长一段时间内保留,而不是被优化掉.向内浮动本地定义.
向Repa的作者Ben Lippmeier寻求帮助,特别是因为有一些时髦的优化问题正在发生.

Answer 2

scl*_*clv 7

我改变了代码,迫使rcoords和particle_extended,和disovered我们直接输在其中的大部分份额的时间:

COST CENTRE                    MODULE               %time %alloc

rcoords                        Main                  32.6   34.4
particle_extended              Main                  21.5   27.2
**^                            Main                   9.8   12.7

Run Code Online (Sandbox Code Playgroud)

对此代码的最大单一改进显然是以更好的方式生成这两个常量输入.

请注意,这基本上是一个懒惰的流式算法,而且你要浪费时间的是一次性分配至少两个24361803元素数组的沉没成本,然后可能至少分配一次或两次或放弃共享.我认为,对于这个代码来说,最好的情况是使用非常好的优化器和无数重写规则,将大致匹配列表版本(也可以非常容易地并行化).

我认为Ben和他的合适.我会对这个基准测试感兴趣,但我绝对怀疑这对于一个严格的数组库来说不是一个好的用例,我怀疑matlab隐藏了一些聪明的优化ngrid功能(优化,我会授予它,它移植到修理可能有用.)

编辑:

这是一种快速而又脏的方法来并行化列表代码.导入Control.Parallel.Strategies然后写numberInsideParticles为:

numberInsideParticles particles coords = length $ filter id $ 
    withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords

Run Code Online (Sandbox Code Playgroud)

这显示了良好的加速,因为我们扩展核心(一个核心12s到8s时3.7s),但火花创建的开销意味着即使是8个核心,我们也只匹配单核心非并行版本.我尝试了一些替代策略并得到了类似的结果.同样,我不确定我们可以做多少比这里的单线程列表版本更好.由于每个粒子的计算都很便宜,我们主要强调的是分配,而不是计算.我想象的这样一个大赢家将是矢量化计算,而且据我所知,它几乎需要手工编码.

另请注意,并行版本大约70%的时间花在GC上,而单核版本花费1%的时间在那里(即分配尽可能有效地融合掉).

Answer 3

Ben*_*ier 7

我已经添加了一些关于如何优化Haskell wiki的Repa程序的建议:http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs

归档时间：	14 年，7 月前
查看次数：	2136 次
最近记录：	14 年，6 月前