查找对于内存来说太大的列表的大小?

Kev*_*son 2 haskell functional-programming lazy-evaluation

这里有全新的Haskell程序员.刚刚完成"了解你一个Haskell"......我对一个有多大特定属性的集合感兴趣.我有一些小参数值的代码,但我想知道如何处理更大的结构.我知道Haskell可以做"无限的数据结构",但我只是没有看到如何从我所在的位置到达那里并且了解一个Haskell/Google并没有让我了解这一点.

这里的工作代码为我eSet定的"小"的论点rt

import Control.Monad
import System.Environment
import System.Exit

myPred :: [Int] -> Bool
myPred a = myPred' [] a
    where
        myPred' [] []         = False
        myPred' [] [0]        = True
        myPred' _  []         = True
        myPred' acc (0:aTail) = myPred' acc aTail
        myPred' acc (a:aTail)
             | a `elem` acc   = False
             | otherwise      = myPred' (a:acc) aTail

superSet :: Int -> Int -> [[Int]]
superSet r t = replicateM r [0..t]

eSet :: Int -> Int -> [[Int]]
eSet r t = filter myPred $ superSet r t

main :: IO ()
main = do
    args <- getArgs
    case args of
        [rArg, tArg] ->
            print $ length $ eSet (read rArg) (read tArg)
        [rArg, tArg, "set"] ->
            print $          eSet (read rArg) (read tArg)
        _ ->
            die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size
Run Code Online (Sandbox Code Playgroud)

编译/运行时,我得到了

$ ghc eSet.hs -rtsopts
[1 of 1] Compiling Main             ( eSet.hs, eSet.o )
Linking eSet ...
$ # Here's is a tiny eSet to illustrate.  It is the set of lists of r integers from zero to t with no repeated nonzero list entries
$ ./eSet 4 2 set
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]]
$ ./eSet 8 4 +RTS -sstderr
3393
     174,406,136 bytes allocated in the heap
      29,061,152 bytes copied during GC
       4,382,568 bytes maximum residency (7 sample(s))
         148,664 bytes maximum slop
              14 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       328 colls,     0 par    0.047s   0.047s     0.0001s    0.0009s
  Gen  1         7 colls,     0 par    0.055s   0.055s     0.0079s    0.0147s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.298s  (  0.301s elapsed)
  GC      time    0.102s  (  0.102s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time    0.406s  (  0.405s elapsed)

  %GC     time      25.1%  (25.2% elapsed)

  Alloc rate    585,308,888 bytes per MUT second

  Productivity  74.8% of total user, 75.0% of total elapsed

$ ./eSet 10 5 +RTS -sstderr
63591
  27,478,010,744 bytes allocated in the heap
   4,638,903,384 bytes copied during GC
     532,163,096 bytes maximum residency (15 sample(s))
      16,500,072 bytes maximum slop
            1556 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     52656 colls,     0 par    6.865s   6.864s     0.0001s    0.0055s
  Gen  1        15 colls,     0 par    8.853s   8.997s     0.5998s    1.8617s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   52.652s  ( 52.796s elapsed)
  GC      time   15.717s  ( 15.861s elapsed)
  EXIT    time    0.193s  (  0.211s elapsed)
  Total   time   68.564s  ( 68.868s elapsed)

  %GC     time      22.9%  (23.0% elapsed)

  Alloc rate    521,883,277 bytes per MUT second

  Productivity  77.1% of total user, 76.7% of total elapsed
Run Code Online (Sandbox Code Playgroud)

我看到我的内存使用率非常高,并且有很多垃圾收集.在运行时,eSet 12 6我得到一个Segmentation故障.

我觉得filter myPred $ superSet r t让我不要懒惰地让子集一次成为一部分因此我可以处理更大(但有限)的集合,但我不知道如何改变另一种方法来做到这一点.我认为这是我问题的根源.

此外,由于这是我的第一个Haskell程序,因此非常感谢关于样式以及如何实现Haskell模拟"pythonic"的观点!

Dan*_*ner 5

我怀疑这里的罪魁祸首是replicateM,它有这个实现:

replicateM cnt0 f =
    loop cnt0
  where
    loop cnt
        | cnt <= 0  = pure []
        | otherwise = liftA2 (:) f (loop (cnt - 1))
Run Code Online (Sandbox Code Playgroud)

问题是liftA2 (:) f (loop (cnt - 1)); 可能loop (cnt - 1)是在(:)部分应用于元素的所有调用之间共享f,因此loop (cnt - 1)必须保存在内存中.不幸的loop (cnt - 1)是相当长的清单......

说服GHC 不要分享一些东西可能有点繁琐.以下重新定义superSet给了我一个很好的平坦内存使用; 当然,在适合内存的示例中,它可能会慢一些.关键的想法是让它看起来像未经训练的眼睛(即GHC),就像递归的monadic动作取决于之前做出的选择,即使它没有.

superSet :: Int -> Int -> [[Int]]
superSet r t = go r 0 where
    go 0 ignored = if ignored == 0 then [[]] else [[]]
    go r ignored = do
        x <- [0..t]
        xs <- go (r-1) (ignored+x)
        return (x:xs)
Run Code Online (Sandbox Code Playgroud)

如果您不介意避免优化,更自然的定义也可以:

superSet 0 t = [[]]
superSet r t = do
    x <- [0..t]
    xs <- superSet (r-1) t
    return (x:xs)
Run Code Online (Sandbox Code Playgroud)

...但是-O2GHC太聪明了,并且注意到它可以共享递归调用.