Haskell/GHC每线程内存成本

Joh*_*iss 20 memory multithreading haskell ghc

我试图理解Haskell(OS X 10.10.5上的GHC 7.10.1)中的(绿色)线程有多贵.我知道它与真正的操作系统线程相比非常便宜,无论是内存使用还是CPU.

是的,所以我开始编写一个带有forks n(绿色)线程的超简单程序(使用优秀的async库),然后只需每个线程休眠m几秒钟.

嗯,这很容易:

$ cat PerTheadMem.hs 
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]
Run Code Online (Sandbox Code Playgroud)

首先,让我们编译并运行它:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
Run Code Online (Sandbox Code Playgroud)

应该分叉100k线程并在每个线程中等待10个,然后打印一些信息:

$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0       648 colls,     0 par    0.373s   0.415s     0.0006s    0.0223s
Gen  1        11 colls,     0 par    0.298s   0.431s     0.0392s    0.1535s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time   79.062s  ( 92.803s elapsed)
GC      time    0.670s  (  0.846s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.065s  (  0.091s elapsed)
Total   time   79.798s  ( 93.740s elapsed)

%GC     time       0.8%  (0.9% elapsed)

Alloc rate    4,312,344 bytes per MUT second

Productivity  99.2% of total user, 84.4% of total elapsed


real    1m33.757s
user    1m19.799s
sys 0m2.260s
Run Code Online (Sandbox Code Playgroud)

花了很长时间(1m33.757s),因为每个线程应该只等待10秒,但我们现在已经构建非线程非常公平.总而言之,我们使用350 MB,这不是太糟糕,每个线程3.5 KB.鉴于初始堆栈大小(-ki为1 KB).

是的,但现在让我们编译处于线程模式,看看我们是否可以加快速度:

$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0      6435 colls,     0 par    0.860s   1.022s     0.0002s    0.0028s
Gen  1        20 colls,     0 par    2.206s   2.740s     0.1370s    0.3874s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.000s  (  0.001s elapsed)
MUT     time    0.879s  (  8.534s elapsed)
GC      time    3.066s  (  3.762s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.074s  (  0.247s elapsed)
Total   time    4.021s  ( 12.545s elapsed)

Alloc rate    4,544,893,364 bytes per MUT second

Productivity  23.7% of total user, 7.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m12.565s
user    0m4.021s
sys 0m1.154s
Run Code Online (Sandbox Code Playgroud)

哇,快,只是现在12S,更好的方式.从Activity Monitor我看到它大致使用4个OS线程用于100k绿色线程,这是有道理的.

但是,总内存为3657 MB!这比使用的非线程版本多10倍......

到现在为止,我没有使用做任何分析-prof-hy左右.为了进一步研究,我-hy单独的运行中做了一些堆profiling().在任何一种情况下内存使用都没有改变,堆分析图看起来有趣的不同(左:非线程,右:线程)但我找不到10x差异的原因. 堆配置文件差异

差异化分析输出(.prof文件)我也找不到任何真正的区别. 教授差异

因此我的问题是:内存使用的10倍差异来自哪里?

编辑:提一下:当程序甚至没有使用分析支持进行编译时,同样的差异也适用.因此,运行time ./PerTheadMem 100000 10 +RTS -sstderrghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs为3559 MB.而随着ghc -rtsopts -fforce-recomp PerTheadMem.hs它的395 MB.

编辑2:在Linux(GHC 7.10.2on Linux 3.13.0-32-generic #57-Ubuntu SMP, x86_64)上同样的情况发生:1m28.538s中的非线程460 MB和3484 MB的线程是12.604s./usr/bin/time -v ...报告Maximum resident set size (kbytes): 413684Maximum resident set size (kbytes): 1645384分别.

编辑3:也改变了程序forkIO直接使用:

import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mvar <- newEmptyMVar
    mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
          [1..numThreads]
    mapM_ (\_ -> takeMVar mvar) [1..numThreads]
Run Code Online (Sandbox Code Playgroud)

并且它不会改变任何东西:非线程:152 MB,线程:3308 MB.

dsi*_*ign 11

恕我直言,罪魁祸首是threadDelay.*threadDelay**使用了大量内存.这是一个与你的程序相当的程序,它在内存方面表现得更好.它通过长时间运行的计算确保所有线程同时运行.

uBound = 38
lBound = 34

doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
                  a = doSomething (n-1) 
                  b = doSomething (n-2) 
                in a `seq` b `seq` (a + b)
              | otherwise = doSomething (n `mod` uBound )

e :: Chan Integer -> Int -> IO ()
e mvar i = 
    do
        let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) ) 
        y `seq` writeChan mvar y

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
            dld = (sleep*1000*1000) 
        chan <- newChan
        mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
        putStrLn "All threads created"
        mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
        putStrLn "All read"
Run Code Online (Sandbox Code Playgroud)

以下是时间统计:

 $ ghc -rtsopts -O -threaded  test.hs
 $ ./test 200 10 +RTS -sstderr -N4

 133,541,985,480 bytes allocated in the heap
     176,531,576 bytes copied during GC
         356,384 bytes maximum residency (16 sample(s))
          94,256 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     64246 colls, 64246 par    1.185s   0.901s     0.0000s    0.0274s
  Gen  1        16 colls,    15 par    0.004s   0.002s     0.0001s    0.0002s

  Parallel GC work balance: 65.96% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time   63.747s  ( 16.333s elapsed)
  GC      time    1.189s  (  0.903s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   64.938s  ( 17.239s elapsed)

  Alloc rate    2,094,861,384 bytes per MUT second

  Productivity  98.2% of total user, 369.8% of total elapsed

gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2
Run Code Online (Sandbox Code Playgroud)

每个线程的最大驻留时间约为1.5 kb.我玩了一些线程数和计算的运行长度.由于线程在forkIO之后立即开始执行操作,因此创建100000个线程实际上需要很长时间.但结果持有1000个线程.

这是另一个程序,其中threadDelay已被"排除",这个程序不使用任何CPU,并且可以使用100000个线程轻松执行:

e :: MVar () -> MVar () -> IO ()
e start end = 
    do
        takeMVar start
        putMVar end ()

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
        starts <- mapM (const newEmptyMVar ) [1..numThreads]
        ends <- mapM (const newEmptyMVar ) [1..numThreads]
        mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
        mapM_ (\ start -> putMVar start () ) starts
        putStrLn "All threads created"
        threadDelay (sleep * 1000 * 1000)
        mapM_ (\ end -> takeMVar end ) ends
        putStrLn "All done"
Run Code Online (Sandbox Code Playgroud)

结果如下:

     129,270,632 bytes allocated in the heap
     404,154,872 bytes copied during GC
      77,844,160 bytes maximum residency (10 sample(s))
      10,929,688 bytes maximum slop
             165 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       128 colls,   128 par    0.178s   0.079s     0.0006s    0.0152s
  Gen  1        10 colls,     9 par    0.367s   0.137s     0.0137s    0.0325s

  Parallel GC work balance: 50.09% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.189s  ( 10.094s elapsed)
  GC      time    0.545s  (  0.217s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time    0.735s  ( 10.313s elapsed)

  Alloc rate    685,509,460 bytes per MUT second

  Productivity  25.9% of total user, 1.8% of total elapsed
Run Code Online (Sandbox Code Playgroud)

在我的i5上,创建100000个线程并放入"start"mvar只需不到一秒钟.每个线程的峰值驻留大约为778字节,一点都不差!


检查threadDelay的实现,我们发现它对于线程和非线程的情况实际上是不同的:

https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay

然后在这里:https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html

看起来很无辜.但是对于那些调用threadDelay的人来说,旧版本的base有一个神秘的(内存)厄运拼写:

https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121

如果还有问题,很难说.但是,人们总是希望"真实生活"并发程序不需要同时在threadDelay上等待太多线程.我从一开始就会关注我对threadDelay的使用.