加速Haskell并发

jos*_*uan 8 memory concurrency haskell lazy-evaluation

MVar,TVar,IORef,...我无法加快一个thunk问题(我想).

(我原来的问题是一个线程代码,我做"forkIO"n次调用"addMany";但我认为我的问题是"shW"函数)

让下一个代码:

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
import Control.Monad
import System.Environment(getArgs)
import Data.Int
import Data.IORef

-- "i" times, add "n" for each IORef (in "a")
addMany :: [IORef Int64] -> Int64 -> Int64 -> IO ()
addMany !a !n !i =
  forM_ [1..i] (\_ ->
    forM_ a (shW n))

-- MVar, TVar, IORef, ... read/write (x' = x + k)
shR = readIORef
shW !k !r = atomicModifyIORef r (\ !x' -> (x' + k, ()))

main = do
  (n':i':_) <- getArgs
  let (n, i) = (read n', read i')
  v <- forM [1..n] (\_ -> newIORef 0)
  addMany v 1 i
  mapM shR v >>= (putStrLn.show.sum)
Run Code Online (Sandbox Code Playgroud)

然后,个人资料结果显示:

MUT     time    3.12s  (  3.12s elapsed)
GC      time    6.96s  (  6.96s elapsed)
...

COST CENTRE  MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                     47           0    0.0    0.0   100.0  100.0
 main        Main                     95           0    0.0    0.0   100.0  100.0
  main.i     Main                    100           1    0.0    0.0     0.0    0.0
  addMany    Main                     99           1    0.4    0.5   100.0  100.0
   addMany.\ Main                    101       15000    6.6    0.0    99.6   99.5
    shW      Main                    102     2250000   92.7   99.5    93.0   99.5
     shW.\   Main                    104     2250000    0.3    0.0     0.3    0.0
Run Code Online (Sandbox Code Playgroud)

我无法在"shW"调用中删除thunk(内存使用量很大).怎么了?

类似的C#代码运行得更快(更快):

class Node { 
    private object m; 
    private int n; 

    public Node() {n = 0; m = new object();} 
    public void Inc() {lock(m) n++;} 
    public int Read() {return n;} 
} 

class MainClass { 

    public static void Main(string[] args) { 

        var nitems = 1000; 
        var nthreads = 6; 
        var niters = 100000; 

        var g = Enumerable.Range(0, nitems).Select(i => new Node()).ToArray(); 
        Task.WaitAll(Enumerable.Range(0, nthreads).Select(q => Task.Factory.StartNew(() => { 
            var z = niters; 
            while(z-- > 0) 
                foreach(var i in g) 
                    i.Inc(); 
        })).ToArray()); 

        Console.WriteLine("Sum node values: {0}", g.Sum(i => i.Read())); 

    } 
} 
Run Code Online (Sandbox Code Playgroud)

非常感谢!

UPDATE

解决原始问题:https://gist.github.com/3742897

非常感谢Don Stewart!

Don*_*art 27

当您查看堆和GC时间时,很明显会发生泄漏:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
   1,208,298,840 bytes allocated in the heap
   1,352,861,868 bytes copied during GC
     280,181,684 bytes maximum residency (9 sample(s))
       4,688,680 bytes maximum slop
             545 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1677 colls,     0 par    2.27s    2.22s     0.0013s    0.0063s
  Gen  1         9 colls,     0 par    1.66s    1.77s     0.1969s    1.0273s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.70s  (  0.77s elapsed)
  GC      time    3.92s  (  4.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.62s  (  4.77s elapsed)

  %GC     time      84.8%  (83.8% elapsed)

  Alloc rate    1,718,469,461 bytes per MUT second

  Productivity  15.2% of total user, 14.7% of total elapsed

  real    0m4.752s
  user    0m0.015s
  sys     0m0.046s
Run Code Online (Sandbox Code Playgroud)

280M驻留率和89%GC.许多thunk被分配并扔掉了.

堆配置文件使这看起来很复杂.

在此输入图像描述

线索是这些是"stg_app*"的东西(即STG机器应用thunk).

修改函数族的一个微妙但非常显着的问题是这里的问题 - 当你有一个懒惰的atomicModify时,根本没有办法严格更新字段,而不需要值.

因此,您所有谨慎使用的atomicModifyIORef r (\ !x' -> (x' + k, ()))是构建(+)函数的一系列应用程序,以便观察链的结果(即单元格中的值),每次添加在其参数中都是严格的.不是你想要的!您对modifyIORef 参数的严格注释都不会对单元本身产生任何影响.现在,通常一个懒惰的修改是你想要的 - 它只是一个指针交换,所以你可以有非常短的原子部分.

但有时这不是你想要的.

(关于这个问题的背景,请参阅GHC票#5926,但是,这个问题至少在2007年就已知,当时我编写了严格并发包以避免MVars的这个问题.它在2009讨论过,我们现在已经严格2012年的版本).

首先要求该值,您可以删除该问题.例如

shW !k !r = --atomicModifyIORef r (\x -> (x + k, ()))
    do x <- shR r
       writeIORef r $! (x+1)
Run Code Online (Sandbox Code Playgroud)

请注意,此问题现已记录在库中,您可以使用atomicModifyIORef'它来避免它.

我们得到:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
     120,738,836 bytes allocated in the heap
       3,758,476 bytes copied during GC
          73,020 bytes maximum residency (1 sample(s))
          16,348 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       230 colls,     0 par    0.00s    0.01s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.17s  (  0.17s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.19s  (  0.17s elapsed)

  %GC     time       0.0%  (3.9% elapsed)

  Alloc rate    643,940,458 bytes per MUT second

  Productivity 100.0% of total user, 108.3% of total elapsed


real    0m0.218s
user    0m0.000s
sys     0m0.015s
Run Code Online (Sandbox Code Playgroud)

也就是说,22倍的加速,内存使用变得不变.对于笑,这是新的堆配置文件:

在此输入图像描述