How do I parse a large data block into memory in Haskell?

Sav*_*nel 9 performance haskell space-leak

On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that

  • looks like a list
  • has O(1) lookup
  • has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
  • has very little memory overhead

I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:

3-10 photographs, the smallest being 45MB and the largest being 110MB.

Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...

I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }
Run Code Online (Sandbox Code Playgroud)

Then I wrote a parser that works on three vectors like so:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }
Run Code Online (Sandbox Code Playgroud)

And at the end of the parser, I construct the RGB8 like so:

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC
Run Code Online (Sandbox Code Playgroud)

Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress so that redC, greenC, and blueC are all !(Vector Word8), then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.

On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.

The fundamental problem is actually pretty straightforward:

How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.

In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.


Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress data structure of Codec.Image.Netpbm.

To run the performance test

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr
Run Code Online (Sandbox Code Playgroud)

And*_*ács 4

我首先认为,只需读取整个字节串,然后将内容解压缩到未装箱的向量中就足够了。事实上,即使没有神秘的空间泄漏,您发布的解析代码也会相当糟糕:您在输入的每个字节上复制所有三个向量的全部!谈论二次复杂度。

所以我写了以下内容:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)
Run Code Online (Sandbox Code Playgroud)

然后使用 45 Mb 的随机字节文件对其进行测试。我承认我很惊讶这段代码导致了千兆字节的 RAM 使用。我很好奇空间泄漏到底在哪里。

不过,可变向量效果很好。以下代码使用 133 Mb RAM,Criterion 将其基准测试为 60 毫秒文件读取。我在评论中添加了一些解释。SO 和其他地方也有大量关于 ST monad 和可变向量的材料(我同意图书馆文档对初学者不友好)。

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]
Run Code Online (Sandbox Code Playgroud)