Mai*_*tor 21 performance haskell loops
在Haskell中有两种显而易见的"惯用"方法来执行嵌套循环:使用列表monad或使用forM_替换传统fors.我已经设置了一个基准来确定它们是否被编译为紧密循环:
import Control.Monad.Loop
import Control.Monad.Primitive
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Unboxed as V
times = 100000
side = 100
-- Using `forM_` to replace traditional fors
test_a mvec =
forM_ [0..times-1] $ \ n -> do
forM_ [0..side-1] $ \ y -> do
forM_ [0..side-1] $ \ x -> do
MV.write mvec (y*side+x) 1
-- Using the list monad to replace traditional forms
test_b mvec = sequence_ $ do
n <- [0..times-1]
y <- [0..side-1]
x <- [0..side-1]
return $ MV.write mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
-- test_a mvec
-- test_b mvec
vec' <- V.unsafeFreeze mvec :: IO (V.Vector Int)
print $ V.sum vec'
Run Code Online (Sandbox Code Playgroud)
此测试创建一个100x100向量,使用嵌套循环向每个索引写入1并重复100k次.编译那些刚ghc -O2 test.hs -o test(GHC版本7.8.4),结果是:3.853s在forM_版本和10.460s针对list monad.为了提供参考,我还用JavaScript编写了这个测试:
var side = 100;
var times = 100000;
var vec = [];
for (var i=0; i<side*side; ++i)
vec.push(0);
for (var n=0; n<times; ++n)
for (var y=0; y<side; ++y)
for (var x=0; x<side; ++x)
vec[x+y*side] = 1;
var s = 0;
for (var i=0; i<side*side; ++i)
s += vec[i];
console.log(s);
Run Code Online (Sandbox Code Playgroud)
这个等效的JavaScript程序需要1s完成,击败Haskell的未装箱的向量,这是不寻常的,这表明Haskell没有在恒定空间中运行循环,而是进行分配.然后我找到了一个声称提供类型保证紧密循环的库Control.Monad.Loop:
-- Using `for` from Control.Monad.Loop
test_c mvec = exec_ $ do
n <- for 0 (< times) (+ 1)
x <- for 0 (< side) (+ 1)
y <- for 0 (< side) (+ 1)
liftIO (MV.write mvec (y*side+x) 1)
Run Code Online (Sandbox Code Playgroud)
哪个运行1s.但是,这个库并不是很常用,而且远非惯用,因此,获得快速恒定空间二维计算的惯用方法是什么?(注意这不是REPA的情况,因为我想在网格上执行任意IO操作.)
And*_*ács 16
Writing tight mutating code with GHC can be tricky sometimes. I'm going to write about a couple of different things, probably in a manner that is more rambling and tl;dr than I would prefer.
For starters, we should use GHC 7.10 in any case, since otherwise the forM_ and list monad solutions never fuse.
Also, I replaced MV.write with MV.unsafeWrite, partly because it's faster, but more importantly it reduces some of the clutter in the resultant Core. From now on runtime statistics refer to code with unsafeWrite.
即使使用GHC 7.10,我们也应该首先注意所有这些[0..times-1]和[0..side-1]表达,因为如果我们不采取必要的步骤,它们每次都会破坏性能.问题是它们是常量范围,并且-ffull-laziness(默认情况下启用-O)会将它们浮动到顶层.这可以防止列表融合,并且迭代一个Int#范围比迭代一个盒装Int-s 列表便宜,所以这是一个非常糟糕的优化.
让我们在几秒钟内看到一些未更改(除了使用unsafeWrite)代码的运行时.ghc -O2 -fllvm使用,我+RTS -s用于计时.
test_a: 1.6
test_b: 6.2
test_c: 0.6
Run Code Online (Sandbox Code Playgroud)
对于我使用的GHC Core查看ghc -O2 -ddump-simpl -dsuppress-all -dno-suppress-type-signatures.
在这种情况下test_a,[0..99]范围被解除:
main4 :: [Int]
main4 = eftInt 0 99 -- means "enumFromTo" for Int.
Run Code Online (Sandbox Code Playgroud)
虽然最外面的[0..9999]循环融合成尾递归帮助器:
letrec {
a3_s7xL :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7xL =
\ (x_X5zl :: Int#) (s1_X4QY :: State# RealWorld) ->
case a2_s7xF 0 s1_X4QY of _ { (# ipv2_a4NA, ipv3_a4NB #) ->
case x_X5zl of wild_X1S {
__DEFAULT -> a3_s7xL (+# wild_X1S 1) ipv2_a4NA;
99999 -> (# ipv2_a4NA, () #)
}
}; }
Run Code Online (Sandbox Code Playgroud)
在这种情况下test_b,再次只有[0..99]被解除.但是,test_b速度要慢得多,因为它必须构建和排序实际[IO ()]列表.至少GHC足够明智,只能[IO ()]为两个内部循环构建单个循环,然后对它进行排序10000.
let {
lvl7_s4M5 :: [IO ()]
lvl7_s4M5 = -- omitted
letrec {
a2_s7Av :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Av =
\ (x_a5xi :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Au
:: [IO ()] -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Au =
\ (ds_a4Nu :: [IO ()]) (eta1_X1c :: State# RealWorld) ->
case ds_a4Nu of _ {
[] ->
case x_a5xi of wild1_X1y {
__DEFAULT -> a2_s7Av (+# wild1_X1y 1) eta1_X1c;
99999 -> (# eta1_X1c, () #)
};
: y_a4Nz ys_a4NA ->
case (y_a4Nz `cast` ...) eta1_X1c
of _ { (# ipv2_a4Nf, ipv3_a4Ng #) ->
a3_s7Au ys_a4NA ipv2_a4Nf
}
}; } in
a3_s7Au lvl7_s4M5 eta_B1; } in
-- omitted
Run Code Online (Sandbox Code Playgroud)
我们该如何解决这个问题?我们可以解决这个问题{-# OPTIONS_GHC -fno-full-laziness #-}.在我们的案例中,这确实有很大帮助:
test_a: 0.5
test_b: 0.48
test_c: 0.5
Run Code Online (Sandbox Code Playgroud)
或者,我们可以摆弄INLINEpragma.在浮动完成后显然内联函数可以保持良好的性能.我发现即使没有编译指示,GHC也会内联我们的测试函数,但是显式编译指示会导致它仅在浮动后才能内联.例如,如果没有-fno-full-laziness:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE test_a #-}
Run Code Online (Sandbox Code Playgroud)
但过早地内联会导致性能不佳:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE [~2] test_a #-} -- "inline before the first phase please"
Run Code Online (Sandbox Code Playgroud)
这个INLINE解决方案的问题在于,面对GHC的浮动冲击,它是相当脆弱的.例如,手动内联不会保留性能.下面的代码很慢,因为INLINE [~2]它给GHC提供了一个浮出水面的机会:
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
Run Code Online (Sandbox Code Playgroud)
那我们该怎么办?
首先,我认为-fno-full-laziness对于那些想要编写高性能代码并且知道自己在做什么的人来说,使用它是一个完全可行的,甚至是更好的选择.例如,它用于unordered-containers.有了它,我们可以更精确地控制共享,我们可以随时手动浮出或内联.
对于更常规的代码,我认为使用Control.Monad.Loop或提供该功能的任何其他包没有任何问题.许多Haskell用户对依赖小型"边缘"库并不是一丝不苟.我们也可以for在期望的普遍性中重新实现.例如,以下表现与其他解决方案一样好:
for :: Monad m => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()
for init while step body = go init where
go !i | while i = body i >> go (step i)
go i = return ()
{-# INLINE for #-}
Run Code Online (Sandbox Code Playgroud)
我最初+RTS -s对堆分配的数据感到非常困惑.test_a分配非平凡的-fno-full-laziness,也test_c 没有完全懒惰,这些分配与times迭代次数成线性比例,但test_b仅为矢量分配完全懒惰:
-- with -fno-full-laziness, no INLINE pragmas
test_a: 242,521,008 bytes
test_b: 121,008 bytes
test_c: 121,008 bytes -- but 240,120,984 with full laziness!
Run Code Online (Sandbox Code Playgroud)
此外,在这种情况下,INLINEpragma test_c对此毫无帮助.
I spent some time trying to find signs of heap allocation in the Core for the relevant programs, without success, until the realization struck me: GHC stack frames are on the heap, including the frames of the main thread, and the functions that were doing heap allocation were essentially running the thrice-nested loops in at most three stack frames. The heap allocation registered by +RTS -s is just the constant popping and pushing of stack frames.
This is pretty much apparent from the Core for the following code:
{-# OPTIONS_GHC -fno-full-laziness #-}
-- ...
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_a mvec
Run Code Online (Sandbox Code Playgroud)
Which I'm including here in its glory. Feel free to skip.
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5HK :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vr { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vr) of _ {
False ->
case newByteArray# 80000 (s_a5HK `cast` ...)
of _ { (# ipv_a5fv, ipv1_a5fw #) ->
letrec {
$s$wa_s8jS
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8jS =
\ (sc_s8jO :: Int#)
(sc1_s8jP :: Int#)
(sc2_s8jR :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jP 10000) of _ {
False -> (# sc2_s8jR, I# sc_s8jO #);
True ->
case writeIntArray# ipv1_a5fw sc_s8jO 0 (sc2_s8jR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
$s$wa_s8jS (+# sc_s8jO 1) (+# sc1_s8jP 1) (s'#_a5Gn `cast` ...)
}
}; } in
case $s$wa_s8jS 0 0 (ipv_a5fv `cast` ...)
-- end of vector creation -------------------
of _ { (# ipv6_a4Hv, ipv7_a4Hw #) ->
letrec {
a2_s7MJ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7MJ =
\ (x_a5Ho :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7ME :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7ME =
\ (x1_X5Id :: Int#) (eta1_XR :: State# RealWorld) ->
case ipv7_a4Hw of _ { I# dt4_a5x6 ->
case writeIntArray#
(ipv1_a5fw `cast` ...) (*# x1_X5Id 100) 1 (eta1_XR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
letrec {
a4_s7Mz :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7Mz =
\ (x2_X5J8 :: Int#) (eta2_X1U :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5fw `cast` ...)
(+# (*# x1_X5Id 100) x2_X5J8)
1
(eta2_X1U `cast` ...)
of s'#1_X5Hf { __DEFAULT ->
case x2_X5J8 of wild_X2o {
__DEFAULT -> a4_s7Mz (+# wild_X2o 1) (s'#1_X5Hf `cast` ...);
99 -> (# s'#1_X5Hf `cast` ..., () #)
}
}; } in
case a4_s7Mz 1 (s'#_a5Gn `cast` ...)
of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x1_X5Id of wild_X1e {
__DEFAULT -> a3_s7ME (+# wild_X1e 1) ipv2_a4QH;
99 -> (# ipv2_a4QH, () #)
}
}
}
}; } in
case a3_s7ME 0 eta_B1 of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x_a5Ho of wild_X1a {
__DEFAULT -> a2_s7MJ (+# wild_X1a 1) ipv2_a4QH;
99999 -> (# ipv2_a4QH, () #)
}
}; } in
a2_s7MJ 0 (ipv6_a4Hv `cast` ...)
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wm, ww6_a5wn #) ->
: ww5_a5wm ww6_a5wn
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
Run Code Online (Sandbox Code Playgroud)
We can also nicely demonstrate the allocation of frames the following way. Let's change test_a:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-50] $ \ x -> -- change here
MV.unsafeWrite mvec (y*side+x) 1
Run Code Online (Sandbox Code Playgroud)
Now the heap allocation stays exactly the same, because the innermost loop is tail-recursive and uses a single frame. With the following change, the heap allocation halves (to 124,921,008 bytes), because we push and pop half as many frames:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-50] $ \ y -> -- change here
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
Run Code Online (Sandbox Code Playgroud)
test_b and test_c (with no full laziness) instead compile to code that uses a nested case construct inside a single stack frame, and walks over the indices to see which one should be incremented. See the Core for the following main:
{-# LANGUAGE BangPatterns #-} -- later I'll talk about this
{-# OPTIONS_GHC -fno-full-laziness #-}
main = do
let vec = V.generate (side*side) (const 0)
!mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_c mvec
Run Code Online (Sandbox Code Playgroud)
Voila:
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5Iw :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vT { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vT) of _ {
False ->
case newByteArray# 80000 (s_a5Iw `cast` ...)
of _ { (# ipv_a5g3, ipv1_a5g4 #) ->
letrec {
$s$wa_s8ji
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8ji =
\ (sc_s8je :: Int#)
(sc1_s8jf :: Int#)
(sc2_s8jh :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jf 10000) of _ {
False -> (# sc2_s8jh, I# sc_s8je #);
True ->
case writeIntArray# ipv1_a5g4 sc_s8je 0 (sc2_s8jh `cast` ...)
of s'#_a5GP { __DEFAULT ->
$s$wa_s8ji (+# sc_s8je 1) (+# sc1_s8jf 1) (s'#_a5GP `cast` ...)
}
}; } in
case $s$wa_s8ji 0 0 (ipv_a5g3 `cast` ...)
of _ { (# ipv6_a4MX, ipv7_a4MY #) ->
case ipv7_a4MY of _ { I# dt4_a5xy ->
-- end of vector creation
letrec {
a2_s7Q6 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Q6 =
\ (x_a5HT :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Q5 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Q5 =
\ (x1_X5J9 :: Int#) (eta1_XP :: State# RealWorld) ->
letrec {
a4_s7MZ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7MZ =
\ (x2_X5Jl :: Int#) (s1_X4Xb :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5g4 `cast` ...)
(+# (*# x1_X5J9 100) x2_X5Jl)
1
(s1_X4Xb `cast` ...)
of s'#_a5GP { __DEFAULT ->
-- the interesting part! ------------------
case x2_X5Jl of wild_X1y {
__DEFAULT -> a4_s7MZ (+# wild_X1y 1) (s'#_a5GP `cast` ...);
99 ->
case x1_X5J9 of wild1_X1o {
__DEFAULT -> a3_s7Q5 (+# wild1_X1o 1) (s'#_a5GP `cast` ...);
99 ->
case x_a5HT of wild2_X1c {
__DEFAULT -> a2_s7Q6 (+# wild2_X1c 1) (s'#_a5GP `cast` ...);
99999 -> (# s'#_a5GP `cast` ..., () #)
}
}
}
}; } in
a4_s7MZ 0 eta1_XP; } in
a3_s7Q5 0 eta_B1; } in
a2_s7Q6 0 (ipv6_a4MX `cast` ...)
}
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wO, ww6_a5wP #) ->
: ww5_a5wO ww6_a5wP
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
Run Code Online (Sandbox Code Playgroud)
I have to admit that I basically don't know why some code avoids stack frame creation and some doesn't. I suspect that inlining from "the inside" out helps, and a quick inspection informed me that Control.Monad.Loop uses a CPS encoding, which might be relevant here, although the Monad.Loop solution is sensitive to let floating, and I couldn't determine on short notice from the Core why test_c with let floating fails to run in a single stack frame.
Now, the performance benefit of running in a single stack frame is small. We've seen that test_b is only slightly faster than test_a. I include this detour in the answer because I found it edifying.
The so-called state hack makes GHC aggressive in inlining into IO and ST actions. I think I should mention it here, because besides let floating this is the other thing that can thoroughly ruin performance.
The state hack is enabled with optimizations -O, and can possibly slow down programs asymptotically. A simple example from Reid Barton:
import Control.Monad
import Debug.Trace
expensive :: String -> String
expensive x = trace "$$$" x
main :: IO ()
main = do
str <- fmap expensive getLine
replicateM_ 3 $ print str
Run Code Online (Sandbox Code Playgroud)
With GHC-7.10.2, this prints "$$$" once without optimizations but three times with -O2. And it seems that with GHC-7.10, we can't get rid of this behavior with -fno-state-hack (which is the subject of the linked ticket from Reid Barton).
Strict monadic bindings reliably get rid of this problem:
main :: IO ()
main = do
!str <- fmap expensive getLine
replicateM_ 3 $ print str
Run Code Online (Sandbox Code Playgroud)
I think it's good habit to do strict bindings in IO and ST. And I have some experience (not definitive though; I'm far from being a GHC expert) that strict bindings are especially needed if we use -fno-full-laziness. Apparently full laziness can help get rid of some of the work duplication introduced by the inlining caused by the state hack; with test_b and no full laziness, omitting the strict binding on !mvec <- V.unsafeThaw vec caused a slight slowdown and extremely ugly Core output.
| 归档时间: |
|
| 查看次数: |
665 次 |
| 最近记录: |