如何重构这个Haskell随机字节输出器?

Vi.*_*Vi. 7 random io performance haskell

我正在尝试在Haskell内快速生成随机数据,但是当我尝试使用任何惯用方法时,我会得到低速和大GC开销.

这是简短的代码:

import qualified System.Random.Mersenne as RM
import qualified Data.ByteString.Lazy as BL
import qualified System.IO as SI
import Data.Word

main = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    rnd <- RM.randoms  r :: IO [Word8]
    BL.hPutStr SI.stdout $ BL.pack rnd
Run Code Online (Sandbox Code Playgroud)

这是快速代码:

import qualified System.Random.Mersenne as RM
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Put as DBP
import qualified System.IO as SI
import Data.List
import Control.Monad (void, forever)
import Data.Word

main = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    forever $ do
        x0 <- RM.random r :: IO Word32
        x1 <- RM.random r :: IO Word32
        x2 <- RM.random r :: IO Word32
        x3 <- RM.random r :: IO Word32
        x4 <- RM.random r :: IO Word32
        x5 <- RM.random r :: IO Word32
        x6 <- RM.random r :: IO Word32
        x7 <- RM.random r :: IO Word32
        x8 <- RM.random r :: IO Word32
        x9 <- RM.random r :: IO Word32
        xA <- RM.random r :: IO Word32
        xB <- RM.random r :: IO Word32
        xC <- RM.random r :: IO Word32
        xD <- RM.random r :: IO Word32
        xE <- RM.random r :: IO Word32
        xF <- RM.random r :: IO Word32
        c0 <- RM.random r :: IO Word32
        c1 <- RM.random r :: IO Word32
        c2 <- RM.random r :: IO Word32
        c3 <- RM.random r :: IO Word32
        c4 <- RM.random r :: IO Word32
        c5 <- RM.random r :: IO Word32
        c6 <- RM.random r :: IO Word32
        c7 <- RM.random r :: IO Word32
        c8 <- RM.random r :: IO Word32
        c9 <- RM.random r :: IO Word32
        cA <- RM.random r :: IO Word32
        cB <- RM.random r :: IO Word32
        cC <- RM.random r :: IO Word32
        cD <- RM.random r :: IO Word32
        cE <- RM.random r :: IO Word32
        cF <- RM.random r :: IO Word32
        v0 <- RM.random r :: IO Word32
        v1 <- RM.random r :: IO Word32
        v2 <- RM.random r :: IO Word32
        v3 <- RM.random r :: IO Word32
        v4 <- RM.random r :: IO Word32
        v5 <- RM.random r :: IO Word32
        v6 <- RM.random r :: IO Word32
        v7 <- RM.random r :: IO Word32
        v8 <- RM.random r :: IO Word32
        v9 <- RM.random r :: IO Word32
        vA <- RM.random r :: IO Word32
        vB <- RM.random r :: IO Word32
        vC <- RM.random r :: IO Word32
        vD <- RM.random r :: IO Word32
        vE <- RM.random r :: IO Word32
        vF <- RM.random r :: IO Word32
        b0 <- RM.random r :: IO Word32
        b1 <- RM.random r :: IO Word32
        b2 <- RM.random r :: IO Word32
        b3 <- RM.random r :: IO Word32
        b4 <- RM.random r :: IO Word32
        b5 <- RM.random r :: IO Word32
        b6 <- RM.random r :: IO Word32
        b7 <- RM.random r :: IO Word32
        b8 <- RM.random r :: IO Word32
        b9 <- RM.random r :: IO Word32
        bA <- RM.random r :: IO Word32
        bB <- RM.random r :: IO Word32
        bC <- RM.random r :: IO Word32
        bD <- RM.random r :: IO Word32
        bE <- RM.random r :: IO Word32
        bF <- RM.random r :: IO Word32
        BL.hPutStr SI.stdout  $ DBP.runPut $ do
            DBP.putWord32be x0
            DBP.putWord32be x1
            DBP.putWord32be x2
            DBP.putWord32be x3
            DBP.putWord32be x4
            DBP.putWord32be x5
            DBP.putWord32be x6
            DBP.putWord32be x7
            DBP.putWord32be x8
            DBP.putWord32be x9
            DBP.putWord32be xA
            DBP.putWord32be xB
            DBP.putWord32be xC
            DBP.putWord32be xD
            DBP.putWord32be xE
            DBP.putWord32be xF
            DBP.putWord32be c0
            DBP.putWord32be c1
            DBP.putWord32be c2
            DBP.putWord32be c3
            DBP.putWord32be c4
            DBP.putWord32be c5
            DBP.putWord32be c6
            DBP.putWord32be c7
            DBP.putWord32be c8
            DBP.putWord32be c9
            DBP.putWord32be cA
            DBP.putWord32be cB
            DBP.putWord32be cC
            DBP.putWord32be cD
            DBP.putWord32be cE
            DBP.putWord32be cF
            DBP.putWord32be v0
            DBP.putWord32be v1
            DBP.putWord32be v2
            DBP.putWord32be v3
            DBP.putWord32be v4
            DBP.putWord32be v5
            DBP.putWord32be v6
            DBP.putWord32be v7
            DBP.putWord32be v8
            DBP.putWord32be v9
            DBP.putWord32be vA
            DBP.putWord32be vB
            DBP.putWord32be vC
            DBP.putWord32be vD
            DBP.putWord32be vE
            DBP.putWord32be vF
            DBP.putWord32be b0
            DBP.putWord32be b1
            DBP.putWord32be b2
            DBP.putWord32be b3
            DBP.putWord32be b4
            DBP.putWord32be b5
            DBP.putWord32be b6
            DBP.putWord32be b7
            DBP.putWord32be b8
            DBP.putWord32be b9
            DBP.putWord32be bA
            DBP.putWord32be bB
            DBP.putWord32be bC
            DBP.putWord32be bD
            DBP.putWord32be bE
            DBP.putWord32be bF
Run Code Online (Sandbox Code Playgroud)

短代码在我的计算机上每秒输出大约6兆字节的随机字节.快速代码 - 每秒约150兆字节.

如果我在快速代码中将该变量的数量从64减少到16,则速度降至每秒约78兆字节.

如何使这个代码紧凑和惯用而不减慢它?

rkh*_*rov 9

我不认为懒惰的IO在Haskell中被认为是非常惯用的.它可能适用于单行,但对于大型IO密集型程序,haskellers使用iteratees/conduit/pipes/Oleg-knows-what.

首先,-O2 --make在Linux x86-64上使用GHC 7.6.3()编译,在我的计算机上运行原始版本的一些参考点作为参考点.慢懒字节字符串版本:

$ ./rnd +RTS -s | pv | head -c 100M > /dev/null
 100MB 0:00:09 [10,4MB/s] [         <=>                                       ]
   6,843,934,360 bytes allocated in the heap
       2,065,144 bytes copied during GC
          68,000 bytes maximum residency (2 sample(s))
          18,016 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  99.2% of total user, 97.7% of total elapsed
Run Code Online (Sandbox Code Playgroud)

它的速度并不快,但没有GC和内存开销可言.有趣的是,如何以及在何处使用此代码获得37%的GC时间.

具有展开循环的快速版本:

$ ./rndfast +RTS -s | pv | head -c 500M > /dev/null
 500MB 0:00:04 [ 110MB/s] [    <=>                                            ]
  69,434,953,224 bytes allocated in the heap
       9,225,128 bytes copied during GC
          68,000 bytes maximum residency (2 sample(s))
          18,016 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  85.0% of total user, 72.7% of total elapsed
Run Code Online (Sandbox Code Playgroud)

这要快得多,但有趣的是,现在我们有了15%的GC开销.

最后,我的版本使用管道和火焰建造者.它一次生成512个随机Word64s,以生成4 KB数据块以供下游使用.随着我将列表"缓冲区"大小从32增加到512,性能稳步提高,但改进小于128.

import Blaze.ByteString.Builder (Builder)
import Blaze.ByteString.Builder.Word
import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Conduit.Blaze (builderToByteString)
import Data.Word
import System.IO (stdout)
import qualified System.Random.Mersenne as RM

randomStream :: RM.MTGen -> Source IO Builder
randomStream gen = forever $ do
    words <- liftIO $ RM.randoms gen
    yield $ fromWord64shost $ take 512 words

main :: IO ()
main = do
    gen <- RM.newMTGen Nothing
    randomStream gen $= builderToByteString $$ CB.sinkHandle stdout
Run Code Online (Sandbox Code Playgroud)

我注意到,与上面的两个程序不同,它在编译时略快(3-4%)-fllvm,因此下面的输出来自LLVM 3.3生成的二进制.

$ ./rndconduit +RTS -s | pv | head -c 500M > /dev/null
 500MB 0:00:09 [53,2MB/s] [         <=>                                       ]
   8,889,236,736 bytes allocated in the heap
      10,912,024 bytes copied during GC
          36,376 bytes maximum residency (2 sample(s))
          19,024 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  99.0% of total user, 91.9% of total elapsed
Run Code Online (Sandbox Code Playgroud)

因此,它的速度是手动展开版本的两倍,但几乎与懒惰的IO版本一样短且可读,几乎没有GC开销和可预测的内存行为.也许这里有改进的余地:欢迎提出意见.

更新:

结合一些不安全的字节摆弄导管我能够制作生成300+ MB/s随机数据的程序.看起来简单的类型专用尾递归函数比惰性列表和手动展开都更好.

import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Word
import Foreign.Marshal.Array
import Foreign.Ptr
import Foreign.Storable
import System.IO (stdout)
import qualified System.Random.Mersenne as RM


randomChunk :: RM.MTGen -> Int -> IO ByteString
randomChunk gen bufsize = allocaArray bufsize $ \ptr -> do
    loop ptr bufsize
    B.packCStringLen (castPtr ptr, bufsize * sizeOf (undefined :: Word64))
    where
    loop :: Ptr Word64 -> Int -> IO ()
    loop ptr 0 = return ()
    loop ptr n = do
        x <- RM.random gen
        pokeElemOff ptr n x
        loop ptr (n - 1)


chunkStream :: RM.MTGen -> Source IO ByteString
chunkStream gen = forever $ liftIO (randomChunk gen 512) >>= yield


main :: IO ()
main = do
    gen <- RM.newMTGen Nothing
    chunkStream gen $$ CB.sinkHandle stdout
Run Code Online (Sandbox Code Playgroud)

在这种速度下,IO开销实际上变得明显:程序在系统调用中花费超过其运行时间的四分之一,并且head像上面示例中那样添加到管道会大大减慢它的速度.

$ ./rndcond +RTS -s | pv > /dev/null
^C27GB 0:00:10 [ 338MB/s] [         <=>                                       ]
   8,708,628,512 bytes allocated in the heap
       1,646,536 bytes copied during GC
          36,168 bytes maximum residency (2 sample(s))
          17,080 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  98.7% of total user, 73.6% of total elapsed
Run Code Online (Sandbox Code Playgroud)


Dav*_*ani 2

我可以确认第二个版本比第一个版本慢,但程度不同。10秒内,短代码生成了111M的数据,而大代码生成了833M的数据。这是在 Mac OSX Lion 上完成的,使用 7.6.3 和 -O3 进行编译。

虽然我不知道为什么第一个解决方案如此慢,但第二个解决方案可以通过使用replicateMmapM删除重复来简化:

main3 = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    forever $ do
        vals <- sequence $ replicate 64 (RM.random r)
        BL.hPutStr SI.stdout $ DBP.runPut $ mapM_ DBP.putWord32be vals
Run Code Online (Sandbox Code Playgroud)

但这个解决方案仍然较慢,10 秒内生成了 492M 的数据。最后的最后一个解决方案是使用模板 haskell 来生成展开循环的代码:

main4 = do
  r <- RM.newMTGen Nothing :: IO RM.MTGen
  forever $ do
    $(let varCount = 64
          -- | replaces every instance of oldName with newName in the exp
          replaceNames :: (Typeable t, Data t) => String -> Name -> t -> t
          replaceNames oldName replacementName expr = everywhere (mkT changeName) expr where
              changeName name | nameBase name == oldName = replacementName
                              | otherwise       = name
          singleVarExp :: Name -> ExpQ -> ExpQ
          singleVarExp varName next = replaceNames "patternvar" varName <$> [| RM.random r >>= \patternvar -> $(next) |]
          allVarExps :: [Name] -> ExpQ -> ExpQ
          allVarExps (n:ns) next = foldr (\var result -> singleVarExp var result)
                                         (singleVarExp n next) ns

          singleOutputter :: Name -> ExpQ -> ExpQ
          singleOutputter varName next = [| DBP.putWord32be $(varE varName) >> $(next) |]
          allVarOutput :: [Name] -> ExpQ
          allVarOutput (n:ns) = foldr (\var result -> singleOutputter var result)
                                      (singleOutputter n [| return () |]) ns
          printResultExp :: [Name] -> ExpQ
          printResultExp names = [| BL.hPutStr SI.stdout $ DBP.runPut ($(allVarOutput names)) |]

          result = do
            vars <- replicateM varCount $ newName "x"
            allVarExps vars (printResultExp vars)
      in result)
Run Code Online (Sandbox Code Playgroud)

它的运行速度与原始快速版本的运行速度大致相同。它不是很简洁(您的快速解决方案更易于阅读),但您现在可以轻松更改变量的数量,并且仍然可以展开循环。我尝试了512,但除了使编译时间变长之外,它似乎对性能没有太大影响。