我已经开发了一个累积和函数,如下面在Haskell库Repa中定义的那样.但是,在将此函数与转置操作组合时,我遇到了一个问题.以下所有3项操作都在一秒钟内完成:
cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x
Run Code Online (Sandbox Code Playgroud)
但是,如果我写:
cumsum $ transpose x
Run Code Online (Sandbox Code Playgroud)
性能骇人听闻.虽然单独的每个操作在1920x1080图像上都不到一秒钟,但当它们合并时,它们现在需要30秒以上......
关于可能导致这种情况的任何想法?我的直觉告诉我它与延迟阵列有关,而不是在正确的时间强迫等等......但是我还没有足够的经验去追踪它.
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
import Data.Array.Repa as Repa
{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))
{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
where (z :. _) = extent arr
{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)
{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr
Run Code Online (Sandbox Code Playgroud)
Ben*_*ier 25
从库实现者的角度来看,调试它的方法是为可疑操作创建一个包装器,然后查看核心代码以查看fusion是否有效.
-- Main.hs ---------------------------------------------------
import Solver
import Data.Array.Repa.IO.BMP
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP img
-- Solver.hs --------------------------------------------------
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
module Solver (cumsumBMP) where
import Data.Array.Repa as Repa
import Data.Word
{- all your defs -}
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img = cumsum $ transpose img
Run Code Online (Sandbox Code Playgroud)
我将"求解器"代码放在一个单独的模块中,因此我们只需要浏览核心代码以了解我们关心的定义.
编译如下:
touch Solver.hs ; ghc -O2 --make Main.hs \
-ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions > dump
Run Code Online (Sandbox Code Playgroud)
转到定义cumsumBMP并搜索letrec关键字.搜索letrec是查找内部循环的快速方法.
不太远,我看到了这一点:(稍微重新格式化)
case gen_a1tr
of _ {
GenManifest vec_a1tv ->
case sh2_a1tc `cast` ... of _ { :. sh3_a1iu sh4_a1iv ->
case ix'_a1t9 `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
case sh3_a1iu `cast` ... of _ { :. sh5_X1n0 sh6_X1n2 ->
case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
case sh5_X1n0 of _ { :. sh7_X1n8 sh8_X1na ->
...
case sh2'1_X1nb of _ { I# y3_X1nO ->
case sh4_a1iv of _ { I# y4_X1nP ->
case sh2'_a1iA of _ { I# y5_X1nX ->
...
let { x3_a1x6 :: Int# [LclId]
x3_a1x6 =
+#
(*#
(+#
(*#
y1_a1iM
y2_X1nG)
y3_X1nO)
y4_X1nP)
y5_X1nX } in
case >=#
x3_a1x6
0
of ...
Run Code Online (Sandbox Code Playgroud)
灾害!该x3_a1x6绑定显然是做了一些有益的工作(乘法,加法和类似的东西),但它包裹在一个长的系列也是每次循环执行拆箱操作.更糟糕的是,它在每次迭代时都将数组的长度和宽度(形状)拆箱,这些信息将始终相同.GHC应该真的将这些case表达式浮出循环,但它还没有.这是关于GHC trac的问题#4081的一个实例,希望很快就能修复.
解决方法是应用于deepSeqArray传入的数组.这就要求它在顶层(循环外)的价值,这让GHC知道可以进一步提升案例匹配.对于类似的函数cumsumBMP,我们还希望传入的数组已经显示,因此我们可以为此添加一个明确的大小写匹配:
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
= img `deepSeqArray` cumsum $ transpose img
Run Code Online (Sandbox Code Playgroud)
再次编译,内循环现在看起来好多了:
letrec {
$s$wfoldlM'_loop_s2mW [...]
:: Int# -> Word# -> Word# [...]
$s$wfoldlM'_loop_s2mW =
\ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
case <=# sc_s2mA a_s2ji of _ {
False -> sc1_s2mB;
True ->
$s$wfoldlM'_loop_s2mW
(+# sc_s2mA 1)
(narrow8Word#
(plusWord#
sc1_s2mB
(indexWord8Array#
rb3_a2gZ
(+#
rb1_a2gX
(+#
(*#
(+#
(*#
wild19_X1zO
ipv1_X1m5)
sc_s2mA)
ipv2_X1m0)
wild20_X1Ct)))))
}; } in
Run Code Online (Sandbox Code Playgroud)
这是一个紧凑的尾递归循环,只使用原始操作.如果你编译-fllvm -optlo-O3,没有理由不能像同等的C程序一样快.
虽然在运行它时有轻微的打嗝:
desire:tmp benl$ ./Main
Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP
Run Code Online (Sandbox Code Playgroud)
这只是提醒我们,我们需要在调用之前强制执行数组cumsumBMP.
-- Main.hs ---------------------------------------------------
...
import Data.Array.Repa as Repa
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP $ Repa.force img
Run Code Online (Sandbox Code Playgroud)
综上所述:
deepSeqArray在顶级函数中添加一些和模式匹配的goop来解决GHC中当前的不合理问题.cumsumBMP上面函数的最终版本证明了这一点.如果你想让GHC总部尽快解决这个问题,那么就把你自己添加为GHC trac上的问题#4081.修复程序将更加漂亮.indexSlice和朋友联系.一般规则是将goop添加到使用的函数force,fold或sumAll.这些函数实例化对数组数据进行操作的实际循环,即,它们将延迟数组转换为清单值.