过多的垃圾收集(和内存使用？)

Question

过多的垃圾收集(和内存使用？)

cro*_*eea 5 heap monads profiling haskell memory-leaks

我已经确定了一个似乎包含内存泄漏的库的一小部分.下面的代码尽可能小,但仍然产生与实际代码相同的结果.

import System.Random
import Control.Monad.State
import Control.Monad.Loops
import Control.DeepSeq
import Data.Int (Int64)
import qualified Data.Vector.Unboxed as U

vecLen = 2048

main = flip evalStateT (mkStdGen 13) $ do
    let k = 64
    cs <- replicateM k transform
    let sizeCs = k*2*7*vecLen*8 -- 64 samples, 2 elts per list, each of len 7*vecLen, 8 bytes per Int64
    (force cs) `seq` lift $ putStr $ "Expected to use ~ " ++ (show ((fromIntegral sizeCs) / 1000000 :: Double)) ++ " MB of memory\n"

transform :: (Monad m, RandomGen g)
           => StateT g m [U.Vector Int64]
transform = do
      e <- liftM ((U.map round) . (uncurry (U.++)) . U.unzip) $ U.replicateM (vecLen `div` 2) sample
      c1 <- U.replicateM (7*vecLen) $ state random
      return [U.concat $ replicate 7 e, c1]

sample :: (RandomGen g, Monad m) => StateT g m (Double, Double)
sample = do 
    let genUVs = liftM2 (,) (state $ randomR (-1,1)) (state $ randomR (-1,1))
        -- memory usage drops and productivity increases to about 58% if I set the guard to "False" (the real code needs a guard here)
        uvGuard (u,v) = u+v >= 2 -- False -- 
    (u,v) <- iterateWhile uvGuard genUVs
    return (u, v)

Run Code Online (Sandbox Code Playgroud)

删除任何更多代码可显着提高内存使用/ GC,时间或两者的性能.但是,我需要计算上面的代码,所以真正的代码不能更简单.例如,如果我使e和c1都从中获取值sample,则代码使用27 MB内存并在GC中花费9%的运行时间.如果我同时使用e和c1 state random,我使用大约400MB的内存,并且只在GC中花费32%的运行时间.

主要参数是vecLen,我真的需要大约8192.为了加快分析,我生成了下面的所有结果vecLen=2048,但问题是随着vecLen增加而更糟.

用.编译

ghc test -rtsopts

Run Code Online (Sandbox Code Playgroud)

我明白了:

> ./test +RTS -sstderr
Working...
Expected to use ~ 14.680064 MB of memory
Done
   3,961,219,208 bytes allocated in the heap
   2,409,953,720 bytes copied during GC
     383,698,504 bytes maximum residency (17 sample(s))
       3,214,456 bytes maximum slop
             869 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7002 colls,     0 par    1.33s    1.32s     0.0002s    0.0034s
  Gen  1        17 colls,     0 par    1.60s    1.84s     0.1080s    0.5426s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.08s  (  2.12s elapsed)
  GC      time    2.93s  (  3.16s elapsed)
  EXIT    time    0.00s  (  0.03s elapsed)
  Total   time    5.01s  (  5.30s elapsed)

  %GC     time      58.5%  (59.5% elapsed)

  Alloc rate    1,904,312,376 bytes per MUT second

  Productivity  41.5% of total user, 39.2% of total elapsed


real    0m5.306s
user    0m5.008s
sys 0m0.252s

Run Code Online (Sandbox Code Playgroud)

使用-p或-h*进行分析并不会显示太多,至少对我而言.

然而,线程范围很有趣: threadscope

它看起来像我在吹堆,所以GC正在发生,堆大小翻倍.实际上,当我使用-H4000M运行时,线程范围看起来更均匀(工作量更少,双重GC),但我仍然花费大约60%的整体运行时间来执行GC.使用-O2编译更糟糕,超过70%的运行时间用于GC.

问题:1.为什么GC运行如此之多？2. 是我堆的使用出人意料的大？如果是这样,为什么？

对于问题2,我意识到堆使用量可能超过我的"预期"内存使用量,即使是很多.但800MB似乎对我来说太过分了.(这是我应该看的数字吗？)

Answer 1

bga*_*ari 5

为了攻击这样的问题,我经常会在编程区乱丢代码时开始,SCC无论我觉得哪里可能有大量的分配.在这种情况下,我怀疑的e和c1在transform和genUVs中sample,

...

transform :: (Monad m, RandomGen g)
           => StateT g m [U.Vector Int64]
transform = do
      e <- {-# SCC hi1 #-} liftM (U.map round . uncurry (U.++) . U.unzip) $ U.replicateM (vecLen `div` 2) sample
      c1 <- {-# SCC hi2 #-} U.replicateM (7*vecLen) $ state random
      return [U.concat $ replicate 7 e, c1]

sample :: (RandomGen g, Monad m) => StateT g m (Double, Double)
sample = do 
    let genUVs = {-# SCC genUVs #-} liftM2 (,) (state $ randomR (-1,1)) (state $ randomR (-1,1))
        -- memory usage drops and productivity increases to about 58% if I set the guard to "False" (the real code needs a guard here)
        uvGuard (u,v) = u+v >= 2 -- False -- 
    (u,v) <- iterateWhile uvGuard genUVs
    return $ (u, v)

Run Code Online (Sandbox Code Playgroud)

我们首先看看有-hy问题的对象是什么类型.这揭示了许多不同的类型,包括Integer,Int32,StdGen,Int,和(,).使用-hc我们可以确定几乎所有的这些值都在被分配c1的transform.这得到了证实-hr,它告诉我们谁持有对这些对象的引用(从而防止它们被垃圾收集).我们可以c1通过检查它保留的对象类型-hrc1 -hy(假设我们已对其进行注释{-# SCC c1 #-})进一步确认这是罪魁祸首.

c1保留这么多对象的事实表明它在我们喜欢的时候没有被评估.虽然在评估之后c1是一个相当短的向量,但在评估之前它需要几千个随机种子,相关的闭包,以及可能的许多其他对象.

Deepseqing c1将GC时间从59%提高到23%,并将内存消耗降低一个数量级.也就是说,终端return在transform转弯中,

deepseq c1 $ return [U.concat $ replicate 7 e, c1]

Run Code Online (Sandbox Code Playgroud)

在此之后,配置文件看起来相当合理,最大的空间用户大约ARR_WORDS分配10MB transform(如预期的那样),然后是一些元组,可能来自genUVs.

归档时间：	12 年，1 月前
查看次数：	387 次
最近记录：	12 年，1 月前