使用Iteratee库编写"wc -l" - 如何过滤换行符?

Sal*_*Sal 7 haskell filter iterate

我试图使用Haskell Iteratee库提出相当于"wc -l"的东西.下面是"wc"的代码(它只计算单词 - 类似于hackage上的iteratee示例中的代码),并且运行速度非常快:


{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString


length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
length1 = liftI (step 0)
  where
    step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs))
    step !i stream     = idone i stream
{-# INLINE length1 #-}
main = do
  i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int)
  result <- run i'
  print result
  {- Time measured on a linux x86 box: 
  $ time ./test ## above haskell compiled code
  4950996

  real    0m0.013s
  user    0m0.004s
  sys     0m0.007s

  $  time wc -c /usr/share/dict/words
  4950996 /usr/share/dict/words

  real    0m0.003s
  user    0m0.000s
  sys     0m0.002s
  -}
Run Code Online (Sandbox Code Playgroud)

现在,你如何扩展它来计算快速运行的行数?我做了一个使用Prelude.filter的版本只过滤"\n"到长度,但它比linux"wc -l"慢,因为内存太多了,而gc(懒惰评估,我猜).所以,我使用Data.ListLike.filter编写了另一个版本,但它不会编译因为它没有键入check - 这里的帮助将不胜感激:


  {-# LANGUAGE BangPatterns #-}
  import Data.Iteratee as I
  import Data.ListLike as LL
  import Data.Iteratee.IO
  import Data.ByteString
  import Data.Char
  import Data.ByteString.Char8 (pack)

  numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
  numlines = liftI $ step 0
    where
      step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x ->  x == Data.ByteString.Char8.pack "\n")  xs))
      step !i stream = idone i stream
  {-# INLINE numlines #-}

  main = do
    i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
    result <- run i'
    print result
Run Code Online (Sandbox Code Playgroud)

Joh*_*n L 1

已经有很多好的答案了;我在性能方面没有什么可以提供的,但有一些风格方面的要点。

首先,我会这样写:

import Prelude as P
import Data.Iteratee
import qualified Data.Iteratee as I
import qualified Data.Iteratee.IO as I
import qualified Data.ByteString as B
import Data.Char
import System.Environment

-- numLines has a concrete stream type so it's not necessary to provide an
-- annotation later.  It could have a more general type.
numLines :: Monad m => I.Iteratee B.ByteString m Int
numLines = I.foldl' step 0
 where
  --step :: Int -> Word8 -> Int
  step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc

main = do
  f:_   <- getArgs
  words <- run =<< I.enumFile 65536 f numLines
  print words
Run Code Online (Sandbox Code Playgroud)

最大的区别是这个使用了Data.Iteratee.ListLike.foldl'. 请注意,只有单个流元素对步骤函数很重要,而不是流类型。它与您使用的功能完全相同,例如Data.ByteString.Lazy.foldl'

使用foldl'还意味着您不需要使用 手动编写迭代器liftI。除非绝对必要,否则我不鼓励用户这样做。结果通常会更长时间、更难以维持,而且几乎没有任何好处。

最后,我显着增加了缓冲区大小。在我的系统上,这比默认的 4096 稍快enumerator,默认的 4096 又比您选择的 1024 稍快(使用 iteratee)。当然,使用此设置的 YMMV。