为什么这不会在恒定的内存中运行?

Jus*_*ond 9 io haskell lazy-evaluation writefile bytestring

我试图将大量数据写入常量内存中的文件.

import qualified Data.ByteString.Lazy as B

{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids,shuffleds) = createGrids rng aa
    createDirectoryIfMissing True "data/grids/"
    B.writeFile (gridFileName num aa)
                (encode (take num grids))
    B.writeFile (shuffledFileName num aa)
                (encode (take num shuffleds))
Run Code Online (Sandbox Code Playgroud)

然而,这消耗了与大小成比例的内存num.我知道这createGrids是一个足够懒惰的函数,因为我通过将它error "not lazy enough"(如Haskell wiki在此建议的那样)附加到它返回的列表的末尾并且没有引发错误来测试它.take是一个懒惰的函数,在中定义Data.List.encode也是一个定义的惰性函数Data.Binary.B.writeFile定义于Data.ByteString.Lazy.

这是完整的代码,因此您可以执行它:

import Control.Arrow (first)
import Data.Binary
import GHC.Float (double2Float)
import System.Random (next)
import System.Random.Mersenne.Pure64 (PureMT, newPureMT, randomDouble)
import System.Random.Shuffle (shuffle')
import qualified Data.ByteString.Lazy as B

main :: IO ()
main = writeGrids 1000 64

{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids,shuffleds) = createGrids rng aa
    B.writeFile "grids.bin" (encode (take num grids))
    B.writeFile "shuffleds.bin" (encode (take num shuffleds))

{- a random number generator, dimension of grids to make
   returns a pair of lists, the first is a list of grids of dimensions
   aa x aa, the second is a list of the shuffled grids corresponding to the first list -}
createGrids :: PureMT -> Int -> ([[(Float,Float)]],[[(Float,Float)]])
createGrids rng aa = (grids,shuffleds) where
       rs = randomFloats rng
       grids = map (getGridR aa) (chunksOf (2 * aa * aa) rs) 
       shuffleds = shuffler (aa * aa) rng grids

{- length of each grid, a random number generator, a list of grids
   returns a the list with each grid shuffled -}
shuffler :: Int -> PureMT -> [[(Float,Float)]] -> [[(Float,Float)]]
shuffler n rng (xs:xss) = shuffle' xs n rng : shuffler n (snd (next rng))         xss
shuffler _ _ [] = []

{- divides list into chunks of size n -}
chunksOf :: Int -> [a] -> [[a]]
chunksOf n = go 
     where go xs = case splitAt n xs of
              (ys,zs) | null ys -> []
                      | otherwise -> ys : go zs

{- dimension of grid, list of random floats [0,1]
   returns a list of (x,y) points of length n^2 such that all
   points are in the range [0,1] and the points are a randomly 
   perturbed regular grid -}
getGridR :: Int -> [Float] -> [(Float,Float)]
getGridR n rs = pts where
   nn = n * n
   (irs,jrs) = splitAt nn rs
   n' = fromIntegral n
   grid = [ (p,q) | p <- [0..n'-1], q <- [0..n'-1] ]
   pts = zipWith (\(p,q) (ir,jr) -> ((p+ir)/n',(q+jr)/n')) grid (zip irs jrs)

{- an infinite list of random floats in range [0,1] -}
randomFloats :: PureMT -> [Float]
randomFloats rng = let (d,rng') = first double2Float (randomDouble rng)
                   in d : randomFloats rng'
Run Code Online (Sandbox Code Playgroud)

所需的包是:,bytestring,binary,random,mersenne-random-pure64,random-shuffle

And*_*ács 10

内存使用的两个原因:

首先,Data.Binary.encode似乎没有在恒定的空间中运行.以下程序使用910 MB内存:

import Data.Binary
import qualified Data.ByteString.Lazy as B

len = 10000000 :: Int 

main = B.writeFile "grids.bin" $ encode [0..len]
Run Code Online (Sandbox Code Playgroud)

如果我们离开0,len我们得到97 MB的内存使用量.

相反,以下程序使用1 MB:

import qualified Data.ByteString.Lazy.Char8 as B

main = B.writeFile "grids.bin" $ B.pack $ show [0..(1000000::Int)]
Run Code Online (Sandbox Code Playgroud)

其次,在你的程序中shuffleds包含对内容的引用grids,这可以防止垃圾收集grids.因此,当我们打印时grids,我们也会对它进行评估,然后它必须在内存中存放,直到我们完成打印shuffleds.您的程序的以下版本仍然消耗大量内存,但如果我们注释掉两行中的一行,它会使用常量空间B.writeFile.

import qualified Data.ByteString.Lazy.Char8 as B

writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids,shuffleds) = createGrids rng aa
    B.writeFile "grids.bin" (B.pack $ show (take num grids))
    B.writeFile "shuffleds.bin" (B.pack $ show (take num shuffleds))
Run Code Online (Sandbox Code Playgroud)

  • 第一点:`[a]`的`Binary`实例强制列表的主干为`length` (8认同)
  • 二进制编码的字节串的格式是`Word64大端长度|| data`.这解释了jberryman提到的严格脊椎.可以很容易地对块进行序列化并终止布尔值,使得编码形式像`Word16 big endian chunk size || 块|| 假|| 块|| .. || TRUE`.这应该能够以大小O(块大小)运行.要保留以前的格式,您必须重新编写第一个字段,使IO稍微复杂一些. (5认同)

dup*_*ode 7

对于它的价值,这里是一个完整的解决方案,结合了每个人的想法.内存消耗恒定在~6MB(编译-O2).

import Control.Arrow (first)
import Control.Monad.State (state, evalState)
import Data.Binary
import GHC.Float (double2Float)
import System.Random (next)
import System.Random.Mersenne.Pure64 (PureMT, newPureMT, randomDouble)
import System.Random.Shuffle (shuffle')
import qualified Data.ByteString as B (hPut)
import qualified Pipes.Binary as P (encode)
import qualified Pipes.Prelude as P (zip, mapM, drain)
import Pipes (runEffect, (>->))
import System.IO (withFile, IOMode(AppendMode))

main :: IO ()
main = writeGrids 1000 64

{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids, shuffleds) = createGrids rng aa
        gridFile = "grids.bin"
        shuffledFile = "shuffleds.bin"
        encoder = P.encode . SerList . take num
    writeFile gridFile ""
    writeFile shuffledFile ""
    withFile gridFile AppendMode $ \hGr ->
        withFile shuffledFile AppendMode $ \hSh ->
            runEffect
                $ P.zip (encoder grids) (encoder shuffleds)
                >-> P.mapM (\(ch1, ch2) -> B.hPut hGr ch1 >> B.hPut hSh ch2)
                >-> P.drain -- discards the stream of () results.

{- a random number generator, dimension of grids to make
   returns a pair of lists, the first is a list of grids of dimensions
   aa x aa, the second is a list of the shuffled grids corresponding to the first list -}
createGrids :: PureMT -> Int -> ( [[(Float,Float)]], [[(Float,Float)]] )
createGrids rng aa = unzip gridsAndShuffleds where
       rs = randomFloats rng
       grids =  map (getGridR aa) (chunksOf (2 * aa * aa) rs)
       gridsAndShuffleds = shuffler (aa * aa) rng grids

{- length of each grid, a random number generator, a list of grids
   returns a the list with each grid shuffled -}
shuffler :: Int -> PureMT -> [[(Float,Float)]] -> [( [(Float,Float)], [(Float,Float)] )]
shuffler n rng xss = evalState (traverse oneShuffle xss) rng
    where
    oneShuffle xs = state $ \r -> ((xs, shuffle' xs n r), snd (next r))

newtype SerList a = SerList { runSerList :: [a] }
    deriving (Show)

instance Binary a => Binary (SerList a) where
    put (SerList (x:xs)) = put False >> put x >> put (SerList xs)
    put _                = put True
    get = do
        stop <- get :: Get Bool
        if stop
            then return (SerList [])
            else do
                x          <- get
                SerList xs <- get
                return (SerList (x : xs))

{- divides list into chunks of size n -}
chunksOf :: Int -> [a] -> [[a]]
chunksOf n = go 
     where go xs = case splitAt n xs of
              (ys,zs) | null ys -> []
                      | otherwise -> ys : go zs

{- dimension of grid, list of random floats [0,1]
   returns a list of (x,y) points of length n^2 such that all
   points are in the range [0,1] and the points are a randomly 
   perturbed regular grid -}
getGridR :: Int -> [Float] -> [(Float,Float)]
getGridR n rs = pts where
   nn = n * n
   (irs,jrs) = splitAt nn rs
   n' = fromIntegral n
   grid = [ (p,q) | p <- [0..n'-1], q <- [0..n'-1] ]
   pts = zipWith (\(p,q) (ir,jr) -> ((p+ir)/n',(q+jr)/n')) grid (zip irs jrs)

{- an infinite list of random floats in range [0,1] -}
randomFloats :: PureMT -> [Float]
randomFloats rng = let (d,rng') = first double2Float (randomDouble rng)
                   in d : randomFloats rng'
Run Code Online (Sandbox Code Playgroud)

评论变化:

  • shuffler现在是仿State函数的遍历.它通过输入列表在单个传递中生成一对对列表,其中每个网格与其混洗版本配对.createGrids然后(懒洋洋地)解压缩这个列表.

  • 文件是使用pipes机器编写的,其方式受到这个答案的启发(我最初使用的是这个P.foldM).请注意,hPut我使用的是严格的字节字符串,因为它作用于生成器提供的严格块P.zip(在精神上,它是一对成对提供块的惰性字节串).

  • SerList在那里举行自定义Binary实例Thomas M. DuBuisson暗示.请注意,我没有过多考虑get实例方法中的懒惰和严格性.如果这会给您带来麻烦,这个问题看起来很有用.