使用Data.Binary对列表进行延迟解码

Mik*_*cki 6 haskell lazy-evaluation

我懒惰地使用这段代码编码列表(取自这个SO问题):

import Data.Binary

newtype Stream a = Stream { unstream :: [a] }

instance Binary a => Binary (Stream a) where

    put (Stream [])     = putWord8 0
    put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)
Run Code Online (Sandbox Code Playgroud)

问题是解码实现不是懒惰的:

    get = do
        t <- getWord8
        case t of
            0 -> return (Stream [])
            1 -> do x         <- get
                    Stream xs <- get
                    return (Stream (x:xs))
Run Code Online (Sandbox Code Playgroud)

这看起来像我应该是懒惰的,但是如果我们运行这个测试代码:

head $ unstream (decode $ encode $ Stream [1..10000000::Integer] :: Stream Integer)
Run Code Online (Sandbox Code Playgroud)

内存使用量爆炸.出于某种原因,在让我查看第一个元素之前,它想要解码整个列表.

为什么这不是懒惰,我怎么能让它变得懒惰?

Dan*_*her 6

它不是懒惰的,因为Getmonad是一个严格的状态monad(在二进制-0.5.0.2到0.5.1.1 ;它之前是一个懒惰状态monad,并且在二进制-0.6.*它已成为一个延续monad,我没有分析了这种变化的严格性影响):

-- | The parse state
data S = S {-# UNPACK #-} !B.ByteString  -- current chunk
           L.ByteString                  -- the rest of the input
           {-# UNPACK #-} !Int64         -- bytes read

-- | The Get monad is just a State monad carrying around the input ByteString
-- We treat it as a strict state monad. 
newtype Get a = Get { unGet :: S -> (# a, S #) }

-- Definition directly from Control.Monad.State.Strict
instance Monad Get where
    return a  = Get $ \s -> (# a, s #)
    {-# INLINE return #-}

    m >>= k   = Get $ \s -> case unGet m s of
                             (# a, s' #) -> unGet (k a) s'
    {-# INLINE (>>=) #-}
Run Code Online (Sandbox Code Playgroud)

因此最后的递归

get >>= \x ->
get >>= \(Stream xs) ->
return (Stream (x:xs))
Run Code Online (Sandbox Code Playgroud)

Stream在返回之前强制读取整个内容.

我不认为StreamGetmonad中懒惰地解码a是可能的(因此更不用Binary实例).但您可以使用runGetState以下命令编写延迟解码函数:

-- | Run the Get monad applies a 'get'-based parser on the input
-- ByteString. Additional to the result of get it returns the number of
-- consumed bytes and the rest of the input.
runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
runGetState m str off =
    case unGet m (mkState str off) of
      (# a, ~(S s ss newOff) #) -> (a, s `join` ss, newOff)
Run Code Online (Sandbox Code Playgroud)

首先编写一个Get返回a 的解析器Maybe a,

getMaybe :: Binary a => Get (Maybe a)
getMaybe = do
    t <- getWord8
    case t of
      0 -> return Nothing
      _ -> fmap Just get
Run Code Online (Sandbox Code Playgroud)

然后使用它来创建类型的函数(ByteString,Int64) -> Maybe (a,(ByteString,Int64)):

step :: Binary a => (ByteString,Int64) -> Maybe (a,(ByteString,Int64))
step (xs,offset) = case runGetState getMaybe xs offset of
                     (Just v, ys, newOffset) -> Just (v,(ys,newOffset))
                     _                       -> Nothing
Run Code Online (Sandbox Code Playgroud)

然后你可以使用Data.List.unfoldr懒惰解码列表,

lazyDecodeList :: Binary a => ByteString -> [a]
lazyDecodeList xs = unfoldr step (xs,0)
Run Code Online (Sandbox Code Playgroud)

并将其包装在一个 Stream

lazyDecodeStream :: Binary a => ByteString -> Stream a
lazyDecodeStream = Stream . lazyDecodeList
Run Code Online (Sandbox Code Playgroud)

  • 由于性能原因,它已被更改.当二进制文件被添加到GHC时,我们发现当使用严格的状态monad时它快2到3倍. (4认同)
  • `get`在`binary <0.5`中很懒.我不确定细节,但是iirc,切换到严格实现(甚至使用未装箱的元组)的原因是性能.对于一个懒惰的状态monad,它很容易构建巨大的thunk,人们遇到了这个问题 - 巨大的内存消耗和缓慢的性能 - 不是太少.大多数情况下,严格的状态monad比懒惰状态更好或至少不差,所以总体而言,增益似乎大于损失.使用当前实现编写惰性解码器可能比使用旧的泄漏更少. (2认同)