Sal*_*Sal 12 performance haskell vector
我正在使用向量库和状态monad在Haskell中编写一个最长的公共子序列算法(以封装Miller O(NP)算法的非常强制性和可变性).我已经用C语言编写了一些我需要它的项目,现在我正在用Haskell编写它作为一种探索如何编写具有匹配C的良好性能的命令式网格遍历算法的方法.我用unboxed向量编写的版本对于相同的输入,它比C版本慢大约4倍(并且使用正确的优化标志编译 - 我同时使用了系统时钟和Criterion验证Haskell和C版本之间相对时间测量的方法,以及相同的数据类型,包括大输入和小输入).我一直试图弄清楚性能问题可能在哪里,并会欣赏反馈 - 有可能在这里遇到一些众所周知的性能问题,特别是在我在这里大量使用的矢量库中.
在我的代码中,我有一个名为gridWalk的函数,它最常被调用,并且还完成了大部分工作.性能下降很可能会出现,但我无法弄清楚它可能是什么.完整的Haskell代码就在这里.以下代码的片段:
import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when)
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int
type MVI1 s = MVector (PrimState (ST s)) Int
cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
where
n = U.length a
m = U.length b
go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
| otherwise = len
-- function to find previous y on diagonal k for furthest point
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
let k0 = k+offset-1
k1 = k+offset+1
y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
y1 <- MU.unsafeRead fp k1
if y0 > y1 then return (k0,y0)
else return (k1,y1)
{-#INLINE findYP #-}
gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
let !offset = 1+U.length a
(!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset
let xp = yp-k
len = {-#SCC cmp #-} cmp a b xp yp
x = xp+len
y = yp+len
{-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y
return ()
{-#INLINE gridWalk #-}
-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}
Run Code Online (Sandbox Code Playgroud)
我添加了一些成本中心注释,并使用某个LCS输入进行分析以进行测试.这是我得到的:
total time = 2.39 secs (2394 ticks @ 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
gridWalk Main 67.5 52.7
findSnakes Main 23.2 27.8
cmp Main 4.2 0.0
findYP Main 3.5 19.4
updateFP Main 1.6 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 64 0 0.0 0.0 100.0 100.0
main Main 129 0 0.0 0.0 0.0 0.0
CAF Main 127 0 0.0 0.0 100.0 100.0
findSnakes Main 141 0 0.0 0.0 0.0 0.0
main Main 128 1 0.0 0.0 100.0 100.0
findSnakes Main 138 0 0.0 0.0 0.0 0.0
gridWalk Main 139 0 0.0 0.0 0.0 0.0
cmp Main 140 0 0.0 0.0 0.0 0.0
while Main 132 4001 0.1 0.0 100.0 100.0
findSnakes Main 133 12000 23.2 27.8 99.9 99.9
gridWalk Main 134 16004000 67.5 52.7 76.7 72.2
cmp Main 137 16004000 4.2 0.0 4.2 0.0
updateFP Main 136 16004000 1.6 0.0 1.6 0.0
findYP Main 135 16004000 3.5 19.4 3.5 19.4
newVI1 Main 130 1 0.0 0.0 0.0 0.0
newVI1.\ Main 131 8004 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 112 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 104 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 102 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 95 0 0.0 0.0 0.0 0.0
Run Code Online (Sandbox Code Playgroud)
如果我正确地解释分析输出(并假设由于分析没有太多的失真),gridWalk大部分时间,但主要功能cmp和findYP繁重的工作gridWalk,似乎在分析报告中花费很少的时间.那么,也许瓶颈forM_在于findSnakes函数用于调用的包装器gridWalk?堆配置文件似乎也很干净:
阅读核心,没有什么能真正跳出来.我认为内部循环中的某些值可能是盒装的,但我没有在核心中发现它们.我希望性能问题是由于我错过了一些简单的事情.
更新
继@DanielFischer的建议,我更换forM_的Data.Vector.Unboxed与的Control.Monad的findSnakes功能,提高了取4至C版2.5倍的性能.Haskell和C版本现在发布在这里,如果你想试用它们.
我仍在挖掘核心,看看瓶颈在哪里.gridWalk最常被称为函数,并且为了表现良好,lcsh应该将whileM_循环减少到条件检查和内联findSnakes代码的良好迭代内循环.我怀疑在汇编中,这不是whileM_循环的情况,但由于我不太了解转换核心,并在汇编中找到名称受损的GHC函数,我想这只是耐心地解决问题直到我明白了.同时,如果有关于性能修复的任何指示,他们将不胜感激.
我能想到的另一种可能性是函数调用期间堆检查的开销.如剖析报告所示,gridWalk被称为16004000次.假设堆检查有6个循环(我猜它少了,但仍然让我们假设),在3.33GHz盒子上,对于96024000个循环,它是~0.02秒.
还有一些性能数字:
Haskell code (GHC 7.6.1 x86_64):forM_修复前约0.25秒.
time ./T
1
real 0m0.150s
user 0m0.145s
sys 0m0.003s
Run Code Online (Sandbox Code Playgroud)
C code (gcc 4.7.2 x86_64):
time ./test
1
real 0m0.065s
user 0m0.063s
sys 0m0.000s
Run Code Online (Sandbox Code Playgroud)
更新2:
更新的代码在这里.使用STUArray不会改变数字.性能约为1.5倍Mac OS X (x86_64,ghc7.6.1),与@DanielFischer在Linux上的报告非常相似.
Haskell代码:
$ time ./Diff
1
real 0m0.087s
user 0m0.084s
sys 0m0.003s
Run Code Online (Sandbox Code Playgroud)
C代码:
$ time ./test
1
real 0m0.056s
user 0m0.053s
sys 0m0.002s
Run Code Online (Sandbox Code Playgroud)
瞥了一眼cmm,这个调用是尾递归的,并且被转换为循环llvm.但是每个新的迭代似乎都会分配新的值来调用堆检查,因此,可以解释性能上的差异.我必须考虑如何以这样的方式编写尾递归,以便在迭代之间不分配值,从而避免堆检查和分配开销.
Dan*_*her 10
你受到了巨大打击
U.forM_ (U.fromList [0..ct-1])
Run Code Online (Sandbox Code Playgroud)
在findSnakes.我确信不应该发生(票?),但Vector每次findSnakes调用时都会分配一个新的遍历.如果你使用
Control.Monad.forM_ [0 .. ct-1]
Run Code Online (Sandbox Code Playgroud)
相反,运行时间大致减半,这里的分配下降了大约500倍.(GHC优化C.M.forM_ [0 :: Int .. limit]好,列表被淘汰,剩下基本上是一个循环.)你可以做轻微通过自己编写循环更好.
导致无端分配/代码大小膨胀而不会损害性能的一些事情是
Bool参数lcshcmp参数findSnakes和gridWalk; 如果从未使用与顶级不同的比较来调用它们cmp,则该参数会导致不必要的代码重复.while; 将它专门用于使用的类型ST s Bool -> ST s () -> ST s ()会减少分配(很多),也减少运行时间(稍微但很明显,这里).关于性能分析的一般说法:编译用于性能分析的程序会抑制许多优化.特别是对于像图书馆vector,bytestring或者text频繁使用的融合,分析常常会产生误导的结果.
例如,您的原始代码在此处生成
total time = 3.42 secs (3415 ticks @ 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.7 52.7 2176 2432608000
findSnakes Main 20.0 27.8 682 1281440080
cmp Main 9.2 0.0 313 16
findYP Main 4.2 19.4 144 896224000
updateFP Main 2.7 0.0 91 0
Run Code Online (Sandbox Code Playgroud)
只需添加上的结合一声巨响len在gridWalk改变什么都在非纹版,但对于分析版本
total time = 2.98 secs (2985 ticks @ 1000 us, 1 processor)
total alloc = 3,204,404,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.0 32.0 1881 1024256000
findSnakes Main 22.2 40.0 663 1281440080
cmp Main 7.2 0.0 214 16
findYP Main 4.7 28.0 140 896224000
updateFP Main 2.7 0.0 82 0
Run Code Online (Sandbox Code Playgroud)
它有很大的不同.对于包含上述更改(以及bang on lenin gridWalk)的版本,分析版本说
total alloc = 1,923,412,776 bytes (excludes profiling overheads)
Run Code Online (Sandbox Code Playgroud)
但非分析版本
1,814,424 bytes allocated in the heap
10,808 bytes copied during GC
49,064 bytes maximum residency (2 sample(s))
25,912 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.12s ( 0.12s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.12s ( 0.12s elapsed)
Run Code Online (Sandbox Code Playgroud)
says it allocated 1000-fold less than the profiling version.
For vector and friends code, more reliable for identifying bottlenecks than profiling (unfortunately also much much more time-consuming and difficult) is studying the generated core (or assembly, if you are proficient in reading that).
Concerning the update, the C runs a little slower on my box (gcc-4.7.2, -O3)
$ time ./miltest1
real 0m0.074s
user 0m0.073s
sys 0m0.001s
Run Code Online (Sandbox Code Playgroud)
but the Haskell about the same
$ time ./hsmiller
1
real 0m0.151s
user 0m0.149s
sys 0m0.001s
Run Code Online (Sandbox Code Playgroud)
That is a little faster when compiling via the LLVM backend:
$ time ./hsmiller1
real 0m0.131s
user 0m0.129s
sys 0m0.001s
Run Code Online (Sandbox Code Playgroud)
And when we replace the forM_ with a manual loop,
findSnakes a b fp !k !ct op = go 0
where
go x
| x < ct = gridWalk a b fp (op k x) >> go (x+1)
| otherwise = return ()
Run Code Online (Sandbox Code Playgroud)
it gets a bit faster,
$ time ./hsmiller
1
real 0m0.124s
user 0m0.121s
sys 0m0.002s
Run Code Online (Sandbox Code Playgroud)
resp. via LLVM:
$ time ./hsmiller
1
real 0m0.108s
user 0m0.107s
sys 0m0.000s
Run Code Online (Sandbox Code Playgroud)
By and large, the generated core looks fine, one small annoyance was
Main.$wa
:: forall s.
GHC.Prim.Int#
-> GHC.Types.Int
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s, Main.MVI1 s #)
Run Code Online (Sandbox Code Playgroud)
and a slightly roundabout implementation. That is fixed by making newVI1 strict in its second argument,
newVI1 n !x = do
Run Code Online (Sandbox Code Playgroud)
Since that isn't called often, the effect on performance is of course negligible.
The meat is the core for lcsh, and that doesn't look too bad. The only boxed things in that are the Ints read from /written to the STRef, and that is inevitable. What's not so pleasant is that the core contains a lot of code duplication, but in my experience, that rarely is a real performance problem, and not all duplicated code survives the code generation.
and for it to perform well,
lcshshould reducewhileM_loop to a nice iterative inner loop of condition check and inlinedfindSnakescode.
You get an inner loop when you add an INLINE pragma to whileM_, but that loop is not nice, and, in this case it is much slower than having the whileM_ out-of-line (I'm not sure whether it's solely due to code size, but it could be).
| 归档时间: |
|
| 查看次数: |
673 次 |
| 最近记录: |