调试最长公共子序列算法的性能瓶颈

Sal*_*Sal 12 performance haskell vector

我正在使用向量库和状态monad在Haskell中编写一个最长的公共子序列算法(以封装Miller O(NP)算法的非常强制性和可变性).我已经用C语言编写了一些我需要它的项目,现在我正在用Haskell编写它作为一种探索如何编写具有匹配C的良好性能的命令式网格遍历算法的方法.我用unboxed向量编写的版本对于相同的输入,它比C版本慢大约4倍(并且使用正确的优化标志编译 - 我同时使用了系统时钟和Criterion验证Haskell和C版本之间相对时间测量的方法,以及相同的数据类型,包括大输入和小输入).我一直试图弄清楚性能问题可能在哪里,并会欣赏反馈 - 有可能在这里遇到一些众所周知的性能问题,特别是在我在这里大量使用的矢量库中.

在我的代码中,我有一个名为gridWalk的函数,它最常被调用,并且还完成了大部分工作.性能下降很可能会出现,但我无法弄清楚它可能是什么.完整的Haskell代码就在这里.以下代码的片段:

import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when) 
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int


type MVI1 s  = MVector (PrimState (ST s)) Int

cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
               where
                 n = U.length a
                 m = U.length b
                 go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
                                    | otherwise = len

-- function to find previous y on diagonal k for furthest point 
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
              let k0 = k+offset-1
                  k1 = k+offset+1
              y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
              y1 <- MU.unsafeRead fp k1
              if y0 > y1 then return (k0,y0)
              else return (k1,y1)
{-#INLINE findYP #-}

gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
   let !offset = 1+U.length a
   (!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset                          
   let xp = yp-k
       len = {-#SCC cmp #-} cmp a b xp yp
       x = xp+len
       y = yp+len

   {-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y  
   return ()
{-#INLINE gridWalk #-}

-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s ->  Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}
Run Code Online (Sandbox Code Playgroud)

我添加了一些成本中心注释,并使用某个LCS输入进行分析以进行测试.这是我得到的:

  total time  =        2.39 secs   (2394 ticks @ 1000 us, 1 processor)
  total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

gridWalk    Main       67.5   52.7
findSnakes  Main       23.2   27.8
cmp         Main        4.2    0.0
findYP      Main        3.5   19.4
updateFP    Main        1.6    0.0


                                                         individual     inherited
COST CENTRE    MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN           MAIN                     64           0    0.0    0.0   100.0  100.0
 main          Main                    129           0    0.0    0.0     0.0    0.0
 CAF           Main                    127           0    0.0    0.0   100.0  100.0
  findSnakes   Main                    141           0    0.0    0.0     0.0    0.0
  main         Main                    128           1    0.0    0.0   100.0  100.0
   findSnakes  Main                    138           0    0.0    0.0     0.0    0.0
    gridWalk   Main                    139           0    0.0    0.0     0.0    0.0
     cmp       Main                    140           0    0.0    0.0     0.0    0.0
   while       Main                    132        4001    0.1    0.0   100.0  100.0
    findSnakes Main                    133       12000   23.2   27.8    99.9   99.9
     gridWalk  Main                    134    16004000   67.5   52.7    76.7   72.2
      cmp      Main                    137    16004000    4.2    0.0     4.2    0.0
      updateFP Main                    136    16004000    1.6    0.0     1.6    0.0
      findYP   Main                    135    16004000    3.5   19.4     3.5   19.4
   newVI1      Main                    130           1    0.0    0.0     0.0    0.0
   newVI1.\   Main                    131        8004    0.0    0.0     0.0    0.0
 CAF           GHC.Conc.Signal         112           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding         104           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding.Iconv   102           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Handle.FD         95           0    0.0    0.0     0.0    0.0
Run Code Online (Sandbox Code Playgroud)

如果我正确地解释分析输出(并假设由于分析没有太多的失真),gridWalk大部分时间,但主要功能cmpfindYP繁重的工作gridWalk,似乎在分析报告中花费很少的时间.那么,也许瓶颈forM_在于findSnakes函数用于调用的包装器gridWalk?堆配置文件似乎也很干净:堆配置文件

阅读核心,没有什么能真正跳出来.我认为内部循环中的某些值可能是盒装的,但我没有在核心中发现它们.我希望性能问题是由于我错过了一些简单的事情.

更新

继@DanielFischer的建议,我更换forM_Data.Vector.Unboxed与的Control.MonadfindSnakes功能,提高了取4至C版2.5倍的性能.Haskell和C版本现在发布在这里,如果你想试用它们.

我仍在挖掘核心,看看瓶颈在哪里.gridWalk最常被称为函数,并且为了表现良好,lcsh应该将whileM_循环减少到条件检查和内联findSnakes代码的良好迭代内循环.我怀疑在汇编中,这不是whileM_循环的情况,但由于我不太了解转换核心,并在汇编中找到名称受损的GHC函数,我想这只是耐心地解决问题直到我明白了.同时,如果有关于性能修复的任何指示,他们将不胜感激.

我能想到的另一种可能性是函数调用期间堆检查的开销.如剖析报告所示,gridWalk被称为16004000次.假设堆检查有6个循环(我猜它少了,但仍然让我们假设),在3.33GHz盒子上,对于96024000个循环,它是~0.02秒.

还有一些性能数字:

Haskell code (GHC 7.6.1 x86_64):forM_修复前约0.25秒.

 time ./T
1

real    0m0.150s
user    0m0.145s
sys     0m0.003s
Run Code Online (Sandbox Code Playgroud)

C code (gcc 4.7.2 x86_64):

time ./test
1

real    0m0.065s
user    0m0.063s
sys     0m0.000s
Run Code Online (Sandbox Code Playgroud)

更新2:

更新的代码在这里.使用STUArray不会改变数字.性能约为1.5倍Mac OS X (x86_64,ghc7.6.1),与@DanielFischer在Linux上的报告非常相似.

Haskell代码:

$ time ./Diff
1

real    0m0.087s
user    0m0.084s
sys 0m0.003s
Run Code Online (Sandbox Code Playgroud)

C代码:

$ time ./test
1

real    0m0.056s
user    0m0.053s
sys 0m0.002s
Run Code Online (Sandbox Code Playgroud)

瞥了一眼cmm,这个调用是尾递归的,并且被转换为循环llvm.但是每个新的迭代似乎都会分配新的值来调用堆检查,因此,可以解释性能上的差异.我必须考虑如何以这样的方式编写尾递归,以便在迭代之间不分配值,从而避免堆检查和分配开销.

Dan*_*her 10

你受到了巨大打击

U.forM_ (U.fromList [0..ct-1])
Run Code Online (Sandbox Code Playgroud)

findSnakes.我确信不应该发生(票?),但Vector每次findSnakes调用时都会分配一个新的遍历.如果你使用

Control.Monad.forM_ [0 .. ct-1]
Run Code Online (Sandbox Code Playgroud)

相反,运行时间大致减半,这里的分配下降了大约500倍.(GHC优化C.M.forM_ [0 :: Int .. limit]好,列表被淘汰,剩下基本上是一个循环.)你可以做轻微通过自己编写循环更好.

导致无端分配/代码大小膨胀而不会损害性能的一些事情是

  • 未使用的Bool参数lcsh
  • cmp参数findSnakesgridWalk; 如果从未使用与顶级不同的比较来调用它们cmp,则该参数会导致不必要的代码重复.
  • 一般类型while; 将它专门用于使用的类型ST s Bool -> ST s () -> ST s ()会减少分配(很多),也减少运行时间(稍微但很明显,这里).

关于性能分析的一般说法:编译用于性能分析的程序会抑制许多优化.特别是对于像图书馆vector,bytestring或者text频繁使用的融合,分析常常会产生误导的结果.

例如,您的原始代码在此处生成

    total time  =        3.42 secs   (3415 ticks @ 1000 us, 1 processor)
    total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.7   52.7   2176 2432608000
findSnakes  Main       20.0   27.8    682 1281440080
cmp         Main        9.2    0.0    313        16
findYP      Main        4.2   19.4    144 896224000
updateFP    Main        2.7    0.0     91         0
Run Code Online (Sandbox Code Playgroud)

只需添加上的结合一声巨响lengridWalk改变什么都在非纹版,但对于分析版本

    total time  =        2.98 secs   (2985 ticks @ 1000 us, 1 processor)
    total alloc = 3,204,404,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.0   32.0   1881 1024256000
findSnakes  Main       22.2   40.0    663 1281440080
cmp         Main        7.2    0.0    214        16
findYP      Main        4.7   28.0    140 896224000
updateFP    Main        2.7    0.0     82         0
Run Code Online (Sandbox Code Playgroud)

它有很大的不同.对于包含上述更改(以及bang on lenin gridWalk)的版本,分析版本说

total alloc = 1,923,412,776 bytes  (excludes profiling overheads)
Run Code Online (Sandbox Code Playgroud)

但非分析版本

     1,814,424 bytes allocated in the heap
        10,808 bytes copied during GC
        49,064 bytes maximum residency (2 sample(s))
        25,912 bytes maximum slop
             1 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time (elapsed)  Avg pause  Max pause
Gen  0         2 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.12s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.12s  (  0.12s elapsed)
Run Code Online (Sandbox Code Playgroud)

says it allocated 1000-fold less than the profiling version.

For vector and friends code, more reliable for identifying bottlenecks than profiling (unfortunately also much much more time-consuming and difficult) is studying the generated core (or assembly, if you are proficient in reading that).


Concerning the update, the C runs a little slower on my box (gcc-4.7.2, -O3)

$ time ./miltest1

real    0m0.074s
user    0m0.073s
sys     0m0.001s
Run Code Online (Sandbox Code Playgroud)

but the Haskell about the same

$ time ./hsmiller
1

real    0m0.151s
user    0m0.149s
sys     0m0.001s
Run Code Online (Sandbox Code Playgroud)

That is a little faster when compiling via the LLVM backend:

$ time ./hsmiller1

real    0m0.131s
user    0m0.129s
sys     0m0.001s
Run Code Online (Sandbox Code Playgroud)

And when we replace the forM_ with a manual loop,

findSnakes a b fp !k !ct op = go 0
  where
    go x
        | x < ct    = gridWalk a b fp (op k x) >> go (x+1)
        | otherwise = return ()
Run Code Online (Sandbox Code Playgroud)

it gets a bit faster,

$ time ./hsmiller
1

real    0m0.124s
user    0m0.121s
sys     0m0.002s
Run Code Online (Sandbox Code Playgroud)

resp. via LLVM:

$ time ./hsmiller
1

real    0m0.108s
user    0m0.107s
sys     0m0.000s
Run Code Online (Sandbox Code Playgroud)

By and large, the generated core looks fine, one small annoyance was

Main.$wa
  :: forall s.
     GHC.Prim.Int#
     -> GHC.Types.Int
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s, Main.MVI1 s #)
Run Code Online (Sandbox Code Playgroud)

and a slightly roundabout implementation. That is fixed by making newVI1 strict in its second argument,

newVI1 n !x = do
Run Code Online (Sandbox Code Playgroud)

Since that isn't called often, the effect on performance is of course negligible.

The meat is the core for lcsh, and that doesn't look too bad. The only boxed things in that are the Ints read from /written to the STRef, and that is inevitable. What's not so pleasant is that the core contains a lot of code duplication, but in my experience, that rarely is a real performance problem, and not all duplicated code survives the code generation.

and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code.

You get an inner loop when you add an INLINE pragma to whileM_, but that loop is not nice, and, in this case it is much slower than having the whileM_ out-of-line (I'm not sure whether it's solely due to code size, but it could be).