在haskell中解析二进制文件的性能很差

Dav*_*son 15 performance haskell binaryfiles

我有一组打包到文件中的二进制记录,我正在使用Data.ByteString.Lazy和Data.Binary.Get读取它们.使用我当前的实现,8Mb文件需要6秒才能解析.

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade { timestamp :: Int, price :: Int ,  qty :: Int } deriving (Show)

getTrades = do
  empty <- isEmpty
  if empty
    then return []
    else do
      timestamp <- getWord32le          
      price <- getWord32le
      qty <- getWord16le          
      rest <- getTrades
      let trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
      return (trade : rest)

main :: IO()
main = do
  input <- BL.readFile "trades.bin" 
  let trades = runGet getTrades input
  print $ length trades
Run Code Online (Sandbox Code Playgroud)

我能做些什么来加快速度?

Nat*_*ell 20

稍微重构它(基本上是左折)可以提供更好的性能并降低GC开销,相当多地解析一个8388600字节文件.

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade
  { timestamp :: {-# UNPACK #-} !Int
  , price     :: {-# UNPACK #-} !Int 
  , qty       :: {-# UNPACK #-} !Int
  } deriving (Show)

getTrade :: Get Trade
getTrade = do
  timestamp <- getWord32le
  price     <- getWord32le
  qty       <- getWord16le
  return $! Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)

countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
  stepper (!count, !buffer)
    | BL.null buffer = count
    | otherwise      =
        let (trade, rest, _) = runGetState getTrade buffer 0
        in stepper (count+1, rest)

main :: IO()
main = do
  input <- BL.readFile "trades.bin"
  let trades = countTrades input
  print trades
Run Code Online (Sandbox Code Playgroud)

以及相关的运行时统计信息.即使分配编号接近,GC和最大堆大小在修订版之间也有很大差异.

这里的所有例子都是用GHC 7.4.1 -O2构建的.

由于堆栈空间使用过多,原始源使用+ RTS -K1G -RTS运行:

     426,003,680 bytes allocated in the heap
     443,141,672 bytes copied during GC
      99,305,920 bytes maximum residency (9 sample(s))
             203 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.62s  (  0.81s elapsed)

  %GC     time      83.3%  (86.4% elapsed)

丹尼尔的修订:

     357,851,536 bytes allocated in the heap
     220,009,088 bytes copied during GC
      40,846,168 bytes maximum residency (8 sample(s))
              85 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.24s  (  0.28s elapsed)

  %GC     time      69.1%  (71.4% elapsed)

这篇文章:

     290,725,952 bytes allocated in the heap
         109,592 bytes copied during GC
          78,704 bytes maximum residency (10 sample(s))
               2 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.06s  (  0.07s elapsed)

  %GC     time       5.0%  (6.0% elapsed)


Dan*_*her 17

你的代码在不到一秒的时间内解码了一个8MB的文件(ghc-7.4.1) - 当然我编译了-O2.

但是,它需要过多的堆栈空间.你可以减少

  • 时间
  • 堆栈空间
  • 堆空间

需要通过在适当的位置添加更严格的内容,并使用累加器来收集解析到目前为止的交易.

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade { timestamp :: {-# UNPACK #-} !Int
                   , price :: {-# UNPACK #-} !Int 
                   , qty :: {-# UNPACK #-} !Int
                   } deriving (Show)

getTrades :: Get [Trade]
getTrades = go []
  where
    go !acc = do
      empty <- isEmpty
      if empty
        then return $! reverse acc
        else do
          !timestamp <- getWord32le
          !price <- getWord32le
          !qty <- getWord16le
          let !trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
          go (trade : acc)

main :: IO()
main = do
  input <- BL.readFile "trades.bin"
  let trades = runGet getTrades input
  print $ length trades
Run Code Online (Sandbox Code Playgroud)

严格和拆包确保没有任何工作可以通过引用ByteString应该已经忘记的部分来回来咬你.

如果您需要Trade具有惰性字段,您仍然可以通过具有严格字段的​​类型进行解码,并通过map结果列表进行转换,以便从更严格的解码中受益.

但是,代码仍然花费大量时间进行垃圾收集,因此可能仍需要进一步改进.

  • 非常感谢你的回答!你已经帮助了一个菜鸟级别. (3认同)