Haskell使用Data.ByteString实现unix的"cat"程序

sta*_*led 23 unix performance haskell pipeline cat

我有以下Haskell代码,实现了"cat"unix命令行实用程序的简单版本.在400MB文件上以"时间"测试性能,速度大约慢3倍.(我用来测试它的确切脚本在代码下面).

我的问题是:

  1. 这是性能的有效测试吗?
  2. 如何让这个程序运行得更快?
  3. 如何识别Haskell程序中的性能瓶颈?

关于问题2和3:我使用了GHC -prof,然后使用+ RTS -p运行,但我发现这里的输出有点无法提供信息.

来源(Main.hs)

module Main where

import System.IO
import System.Environment
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle buf = do
  hPut stdout buf
  eof <- hIsEOF handle
  unless eof $ do
    buf <- hGetSome handle bufsize
    go handle buf

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- hGetSome handle bufsize
  hSetBuffering stdin $ BlockBuffering (Just bufsize)
  hSetBuffering stdout $ BlockBuffering (Just bufsize)
  go handle buf
Run Code Online (Sandbox Code Playgroud)

时序脚本(run.sh):

#!/usr/bin/env bash

# Generate 10M lines of silly test data
yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge

# Compile with optimisation
ghc -O2 Main.hs

# Run haskell
echo "timing Haskell"
time ./Main huge > /dev/null

echo ""
echo ""

# Run cat
echo "timing 'cat'"
time cat huge > /dev/null
Run Code Online (Sandbox Code Playgroud)

我的结果:

timing Haskell

real    0m0.980s
user    0m0.296s
sys     0m0.684s


timing 'cat'

real    0m0.304s
user    0m0.001s
sys     0m0.302s
Run Code Online (Sandbox Code Playgroud)

使用-prof进行编译并使用+ RTS -p运行时的分析报告如下:

  Sat Dec 13 21:26 2014 Time and Allocation Profiling Report  (Final)

     Main +RTS -p -RTS huge

  total time  =        0.92 secs   (922 ticks @ 1000 us, 1 processor)
  total alloc = 7,258,596,176 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.Conc.Signal           84           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          81           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    76           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           70           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     69           0    0.0    0.0     0.0    0.0
Run Code Online (Sandbox Code Playgroud)

bmk*_*bmk 15

这只是试图解决第二个问题的部分答案:

我尝试使用GHC.IO.BufferAPI 这样的东西:

module Main where

import System.IO
import System.Environment
import GHC.IO.Buffer
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle bufPtr = do
  read <- hGetBuf handle bufPtr bufsize
  when (read > 0) $ do
    hPutBuf stdout bufPtr read
    go handle bufPtr

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- newByteBuffer bufsize WriteBuffer

  withBuffer buf $ go handle
Run Code Online (Sandbox Code Playgroud)

它似乎更接近'猫'的表现,但仍然肯定更慢......

time ./Cat huge > /dev/null 
./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 total

time cat huge > /dev/null  
cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total
Run Code Online (Sandbox Code Playgroud)

我认为使用缓冲区API时,我们可以明确地避免hGetSome在原始代码中使用like 时分配所有缓冲区字节串,但我只是在这里猜测,并且不知道两个编译代码到底发生了什么......

更新:在我的笔记本电脑上添加原始代码的性能:

time ./Cat2 huge > /dev/null
./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total
Run Code Online (Sandbox Code Playgroud)

更新2:添加一些基本的分析结果:

原始代码:

Cat2 +RTS -p -RTS huge

    total time  =        0.21 secs   (211 ticks @ 1000 us, 1 processor)
    total alloc = 6,954,068,112 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.IO.Handle.FD          86           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal           82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           80           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 79           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    75           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     72           0    0.0    0.0     0.0    0.0
Run Code Online (Sandbox Code Playgroud)

缓冲区API代码:

Cat +RTS -p -RTS huge

    total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)
    total alloc =   3,487,712 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0   98.9


                                                      individual     inherited
COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                     44           0  100.0   98.9   100.0  100.0
 CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0
 CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1
 CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0
Run Code Online (Sandbox Code Playgroud)

特别注意分配成本的巨大差异......

  • 请注意,对于每个Haskell程序,还有大约10 ms的固定开销,因此如果从Haskell时序中减去它,它甚至更接近于"cat". (6认同)

Tho*_*son 14

最初的问题让我认为这是关于在提供的确切代码中找到性能问题.由于评论"我希望找到一个更惯用的/"高级"Haskell解决方案"与这个假设相矛盾,我将给出合理执行惯用的Haskell解决方案.

我希望任何熟悉Haskell的随机程序员解决这个问题的方法都是使用Lazy bytestrings.这允许程序员简单地指定读取输入和放置输出的任务,同时让编译器担心使用缓冲和循环结构.

模块主要在哪里

import System.IO
import System.Environment
import Data.ByteString.Lazy as BS

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- BS.hGetContents handle
  hPut stdout buf
Run Code Online (Sandbox Code Playgroud)

结果比原始问题中的代码更具可读性和更好的性能:

timing 'cat'

real    0m0.075s
user    0m0.000s
sys     0m0.074s
timing strict bytestring with GHC -O2

real    0m0.254s
user    0m0.126s
sys     0m0.127s
timing strict bytestring with GHC -O2 -fllvm

real    0m0.267s
user    0m0.132s
sys     0m0.134s
timing lazy bytestring with GHC -O2

real    0m0.091s
user    0m0.023s
sys     0m0.067s
timing lazy bytestring with GHC -O2 -fllvm

real    0m0.091s
user    0m0.021s
sys     0m0.069s
Run Code Online (Sandbox Code Playgroud)

也就是说,懒惰的字节串解决方案比它慢21%cat.将cat最后一个用于优先缓存行为导致59ms运行时将Haskell解决方案放慢51%.

编辑:Dons建议使用内存映射IO将更准确地模拟猫的行为.我不确定该语句有多准确,但mmap几乎总能带来更好的性能,这种情况当然也不例外:

timing memory mapped lazy bytestring with GHC -O2

real    0m0.008s
user    0m0.004s
sys     0m0.003s
Run Code Online (Sandbox Code Playgroud)

这是由以下产生的:

module Main where

import System.IO (stdout)
import System.Environment
import System.IO.Posix.MMap.Lazy
import Data.ByteString.Lazy (hPut)

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  buf     <- unsafeMMapFile file
  hPut stdout buf
Run Code Online (Sandbox Code Playgroud)

  • mmap-bytestring应该更接近于`cat`的行为,使用操作系统进行延迟分配. (4认同)

Mic*_*ael 5

备注后节日:

我不确定现在的问题是人们已经启动它了一下.我想看看是什么bytestring-mmap,所以我制作了一个管道版本来"纠正"它的lazy bytestring模块. https://github.com/michaelt/pipes-bytestring-mmap因此,我使用sibis测试方法组装了所有这些程序.https://github.com/michaelt/pipes-bytestring-mmap/tree/master/bench中仅有的两个模块似乎只是愚蠢的面包和黄油haskell,它们使用花哨的显式缓冲区管理.

无论如何,这里有一些结果:当我们向右移动时,文件大小增加10*.有趣的是看程序在不同文件大小上的差异程度.不使用的程序mmap开始在420M时将其字符显示为"文件长度的线性".在那时,之后,它们几乎完全相同,这表明较小尺寸的相当不同的行为不能过于严肃.这些mmap文件的行为类似(相互之间)有一些好奇心(我复制了)所有这些都是在os x上.

4200000           42000000          420000000         4200000000

timing 'cat'

real  0m0.006s    real  0m0.013s    real  0m0.919s    real  0m8.154s
user  0m0.002s    user  0m0.002s    user  0m0.005s    user  0m0.028s
sys   0m0.003s    sys   0m0.009s    sys   0m0.223s    sys   0m2.179s


timing lazy bytestring - idiomatic Haskell (following Thomas M. DuBuisson) 

real  0m0.009s    real  0m0.025s    real  0m0.894s    real  0m9.146s
user  0m0.002s    user  0m0.006s    user  0m0.078s    user  0m0.787s
sys   0m0.005s    sys   0m0.016s    sys   0m0.288s    sys   0m3.001s


timing fancy buffering following statusfailed

real  0m0.014s    real  0m0.066s    real  0m0.876s    real  0m8.686s
user  0m0.005s    user  0m0.028s    user  0m0.278s    user  0m2.724s
sys   0m0.007s    sys   0m0.035s    sys   0m0.424s    sys   0m4.232s


timing fancier use of GHC.Buf following bmk

real  0m0.011s    real  0m0.018s    real  0m0.831s    real  0m8.218s
user  0m0.002s    user  0m0.003s    user  0m0.034s    user  0m0.289s
sys   0m0.006s    sys   0m0.013s    sys   0m0.236s    sys   0m2.447s


timing Pipes.ByteString following sibi

real  0m0.012s    real  0m0.020s    real  0m0.845s    real  0m8.241s
user  0m0.003s    user  0m0.004s    user  0m0.020s    user  0m0.175s
sys   0m0.007s    sys   0m0.014s    sys   0m0.239s    sys   0m2.509s
Run Code Online (Sandbox Code Playgroud)

然后用 mmap

timing Lazy.MMap following dons and Thomas M. DuBuisson 

real  0m0.006s    real  0m0.006s    real  0m0.037s    real  0m0.133s
user  0m0.002s    user  0m0.002s    user  0m0.006s    user  0m0.051s
sys   0m0.003s    sys   0m0.003s    sys   0m0.013s    sys   0m0.061

timing Pipes.ByteString.MMap with SafeT machinery

real  0m0.006s    real  0m0.010s    real  0m0.051s    real  0m0.196s
user  0m0.002s    user  0m0.004s    user  0m0.012s    user  0m0.099s
sys   0m0.003s    sys   0m0.005s    sys   0m0.016s    sys   0m0.072s


timing Pipes.ByteString.MMap 'withFile' style

real  0m0.008s    real  0m0.008s    real  0m0.142s    real  0m0.134s
user  0m0.002s    user  0m0.002s    user  0m0.007s    user  0m0.046s
sys   0m0.004s    sys   0m0.004s    sys   0m0.016s    sys   0m0.066s
Run Code Online (Sandbox Code Playgroud)