Haskell: unexpected time-complexity in the computation involving large lists

Ben*_*tic 4 performance haskell list time-complexity ghc

I am dealing with the computation which has as an intermediate result a list A=[B], which is a list of K lists of the length L. The time-complexity to compute an element of B is controlled by the parameter M and is theoretically linear in M. Theoretically I would expect the time-complexity for the computation of A to be O(K*L*M). However, this is not the case and I don't understand why?

Here is the simple complete sketch program which exhibits the problem I have explained

import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)

type Point = (Double, Double)

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b in b * (q - fromIntegral (floor q))

standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))

trajectory :: (Point -> Point) -> Point -> [Point] 
trajectory map initial = initial : (trajectory map $ map initial)

justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ []     = []

subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m

ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
                 ps = randoms (mkStdGen 21)
             in take n $ zip qs ps 

ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) -> 
                              Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m = 
    parMap rdeepseq ((map observable . subTrace n m) . orbitGen)

main = let k = 100
           l = 100
           m = 100
           orbitGen = trajectory (standardMap 7)
           observable (p, q) = p^2 - q^2
           initials = ensemble k
           mean xs = (sum xs) / (fromIntegral $ length xs)
           result =   (map mean) 
                    $ transpose 
                    $ ensembleTrace orbitGen observable l m 
                    $ initials
       in mapM_ print result
Run Code Online (Sandbox Code Playgroud)

I am compiling with

$ ghc -O2 stdmap.hs -threaded
Run Code Online (Sandbox Code Playgroud)

and runing with

$ ./stdmap +RTS -N4 > /dev/null
Run Code Online (Sandbox Code Playgroud)

on the intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results for the different sets of the parameters K, L, M (k, l, m in the code of the program)

(K=200,L=200,N=200)   -> real    0m0.774s
                         user    0m2.856s
                         sys     0m0.147s

(K=2000,L=200,M=200)  -> real    0m7.409s
                         user    0m28.102s
                         sys     0m1.080s

(K=200,L=2000,M=200)  -> real    0m7.326s
                         user    0m27.932s
                         sys     0m1.020s

(K=200,L=200,M=2000)  -> real    0m10.581s
                         user    0m38.564s
                         sys     0m3.376s

(K=20000,L=200,M=200) -> real    4m22.156s
                         user    7m30.007s
                         sys     0m40.321s

(K=200,L=20000,M=200) -> real    1m16.222s
                         user    4m45.891s
                         sys     0m15.812s

(K=200,L=200,M=20000) -> real    8m15.060s
                         user    23m10.909s
                         sys     9m24.450s
Run Code Online (Sandbox Code Playgroud)

I don't quite understand where the problem of such a pure scaling might be. If I understand correctly the lists are lazy and should not be constructed, since they are consumed in the head-tail direction? As could be observed from the measurements there is a correlation between the excessive real-time consumption and the excessive system-time consumption as the excess would be on the system account. But if there is some memory management wasting time, this should still scale linearly in K, L, M.

Help!

EDIT

I made changes in the code according to the suggestions given by Daniel Fisher, which indeed solved the bad scaling with respect to M. As pointed out, by forcing the strict evaluation in the trajectory, we avoid the construction of large thunks. I understand the performance improvement behind that, but I still don't understand the bad scaling of the original code, because (if I understand correctly) the space-time-complexity of the construction of the thunk should be linear in M?

另外,我仍然难以理解关于K(整体的大小)的不良缩放.我用K = 8000和K = 16000的改进代码进行了两次额外的测量,保持L = 200,M = 200.按比例缩放至K = 8000,但对于K = 16000,它已经异常.问题似乎是数量overflowed SPARKS,K = 8000时为0,K = 16000时为7802.这可能反映在错误的并发性中,我Q = (MUT cpu time) / (MUT real time)将其量化为理想情况下等于CPU数量的商.但是,对于K = 8000,Q~4,对于K = 16000,Q~2.请帮助我理解这个问题的根源和可能的解决方案.

K = 8000:

$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null

56,905,405,184 bytes allocated in the heap
 503,501,680 bytes copied during GC
  53,781,168 bytes maximum residency (15 sample(s))
   6,289,112 bytes maximum slop
         151 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
Gen  0     27893 colls, 27893 par    7.85s    1.99s     0.0001s    0.0089s
Gen  1        15 colls,    14 par    1.20s    0.30s     0.0202s    0.0558s

Parallel GC work balance: 23.49% (serial 0%, perfect 100%)

TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time   95.90s  ( 24.28s elapsed)
GC      time    9.04s  (  2.29s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time  104.95s  ( 26.58s elapsed)

Alloc rate    593,366,811 bytes per MUT second

Productivity  91.4% of total user, 360.9% of total elapsed

gc_alloc_block_sync: 315819
Run Code Online (Sandbox Code Playgroud)

K = 16000:

$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null

113,809,786,848 bytes allocated in the heap
 1,156,991,152 bytes copied during GC
  114,778,896 bytes maximum residency (18 sample(s))
    11,124,592 bytes maximum slop
      300 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
Gen  0     135521 colls, 135521 par   22.83s    6.59s     0.0000s    0.0190s
Gen  1        18 colls,    17 par    2.72s    0.73s     0.0405s    0.1692s

Parallel GC work balance: 18.05% (serial 0%, perfect 100%)

TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time  221.77s  (139.78s elapsed)
GC      time   25.56s  (  7.32s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time  247.34s  (147.10s elapsed)

Alloc rate    513,176,874 bytes per MUT second

Productivity  89.7% of total user, 150.8% of total elapsed

gc_alloc_block_sync: 814824
Run Code Online (Sandbox Code Playgroud)

Dan*_*her 7

MAD的观点fmod很好,但是没有必要呼叫C,我们可以更好地留在Haskell的土地上(链接线程的票据同时是固定的).麻烦在

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b in b * (q - fromIntegral (floor q))
Run Code Online (Sandbox Code Playgroud)

是那种类型的默认导致floor :: Double -> Integer(并因此fromIntegral :: Integer -> Double)被调用.现在,Integer是一个比较复杂的类型,具有较慢的操作,和从转换IntegerDouble也相对复杂.原始代码(带参数k = l = 200m = 5000)产生了统计数据

./nstdmap +RTS -s -N2 > /dev/null
  60,601,075,392 bytes allocated in the heap
  36,832,004,184 bytes copied during GC
       2,435,272 bytes maximum residency (13741 sample(s))
         887,768 bytes maximum slop
               9 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46734 colls, 46734 par   41.66s   20.87s     0.0004s    0.0058s
  Gen  1     13741 colls, 13740 par   23.18s   11.62s     0.0008s    0.0041s

  Parallel GC work balance: 60.58% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   34.99s  ( 17.60s elapsed)
  GC      time   64.85s  ( 32.49s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   99.84s  ( 50.08s elapsed)

  Alloc rate    1,732,048,869 bytes per MUT second

  Productivity  35.0% of total user, 69.9% of total elapsed
Run Code Online (Sandbox Code Playgroud)

在我的机器上(-N2因为我只有两个物理核心).只需更改代码以使用类型签名floor q :: Int即可

./nstdmap +RTS -s -N2 > /dev/null
  52,105,495,488 bytes allocated in the heap
  29,957,007,208 bytes copied during GC
       2,440,568 bytes maximum residency (10481 sample(s))
         893,224 bytes maximum slop
               8 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     36979 colls, 36979 par   32.96s   16.51s     0.0004s    0.0066s
  Gen  1     10481 colls, 10480 par   16.65s    8.34s     0.0008s    0.0018s

  Parallel GC work balance: 68.64% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.01s  (  0.01s elapsed)
  MUT     time   29.78s  ( 14.94s elapsed)
  GC      time   49.61s  ( 24.85s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   79.40s  ( 39.80s elapsed)

  Alloc rate    1,749,864,775 bytes per MUT second

  Productivity  37.5% of total user, 74.8% of total elapsed
Run Code Online (Sandbox Code Playgroud)

经过时间减少约20%,MUT时间减少13%.不错.如果我们查看floor优化后的代码,我们可以看到原因:

floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
    case double2Int# x of
      n | x <## int2Double# n   -> I# (n -# 1#)
        | otherwise             -> I# n

floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
    case decodeDoubleInteger x of
      (# m, e #)
        | e <# 0#   ->
          case negateInt# e of
            s | s ># 52#    -> if m < 0 then (-1) else 0
              | otherwise   ->
                case TO64 m of
                  n -> FROM64 (n `uncheckedIShiftRA64#` s)
        | otherwise -> shiftLInteger m e
Run Code Online (Sandbox Code Playgroud)

floor :: Double -> Int只是使用机器转换,同时floor :: Double -> Integer需要昂贵的decodeDoubleInteger分支机构.但是在这里floor调用的地方,我们知道所有涉及的Doubles都是非负的,因此floor是相同的truncate,它直接映射到机器转换double2Int#,所以让我们尝试而不是floor:

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   29.29s  ( 14.70s elapsed)
  GC      time   49.17s  ( 24.62s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   78.45s  ( 39.32s elapsed)
Run Code Online (Sandbox Code Playgroud)

一个非常小的减少(预计,这fmod不是真正的瓶颈).为了比较,呼叫C:

  INIT    time    0.01s  (  0.01s elapsed)
  MUT     time   31.46s  ( 15.78s elapsed)
  GC      time   54.05s  ( 27.06s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   85.52s  ( 42.85s elapsed)
Run Code Online (Sandbox Code Playgroud)

有点慢(不出所料,你可以在调用C take的时候执行一些primops).

但那不是大鱼游泳的地方.糟糕的是,只选择m轨迹的每一个元素会导致大量的风暴导致大量的分配,并且需要很长时间来评估时机的到来.因此,让我们消除这种泄漏并使轨迹严格:

{-# LANGUAGE BangPatterns #-}

trajectory :: (Point -> Point) -> Point -> [Point] 
trajectory map !initial@(!a,!b) = initial : (trajectory map $ map initial)
Run Code Online (Sandbox Code Playgroud)

这大大减少了分配和GC时间,因此也缩短了MUT时间:

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   21.83s  ( 10.95s elapsed)
  GC      time    0.72s  (  0.36s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   22.55s  ( 11.31s elapsed)
Run Code Online (Sandbox Code Playgroud)

与原来的fmod,

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   18.26s  (  9.18s elapsed)
  GC      time    0.58s  (  0.29s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   18.84s  (  9.47s elapsed)
Run Code Online (Sandbox Code Playgroud)

(floor q :: Int测量精度与测量精度相同)truncate q :: Int(分配数字略低truncate).


问题似乎在于溢出的SPARKS数量,K = 8000时为0,K = 16000时为7802.这可能反映在糟糕的并发性中

是的(虽然据我所知,这里更正确的术语是并行而不是并发),但是有一个火花池,当它已满时,任何进一步的火花都没有被安排在下一次轮到它的时候进行评估.然后,从父线程开始计算,没有并行性.在这种情况下,这意味着在初始并行阶段之后,计算回退到顺序.

火花池的大小显然约为8K(2 ^ 13).

如果你通过顶部观察CPU负载,你会看到它(close to 100%)*(number of cores)在一段时间后从一个更低的值下降(对我来说,它是~100%-N2和~130%-N4).

治愈是为了避免过多的火花,并让每个火花做更多的工作.随着快速和肮脏的修改

ensembleTrace orbitGen observable n m =
    withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
Run Code Online (Sandbox Code Playgroud)

-N2几乎整个运行和良好的生产力,我回到了200%

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   57.42s  ( 29.02s elapsed)
  GC      time    5.34s  (  2.69s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   62.76s  ( 31.71s elapsed)

  Alloc rate    1,982,155,167 bytes per MUT second

  Productivity  91.5% of total user, 181.1% of total elapsed
Run Code Online (Sandbox Code Playgroud)

并且-N4它也很好(即使在挂钟上速度稍微快一点 - 因为所有线程基本相同,并且我只有2个物理内核),

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   99.17s  ( 26.31s elapsed)
  GC      time   16.18s  (  4.80s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  115.36s  ( 31.12s elapsed)

  Alloc rate    1,147,619,609 bytes per MUT second

  Productivity  86.0% of total user, 318.7% of total elapsed
Run Code Online (Sandbox Code Playgroud)

从现在起火花池不会溢出.

正确的解决方法是使块的大小成为根据轨迹数和可用核计算的参数,以便火花的数量不超过池大小.