如何在写入文件时压缩输出?

Ale*_*aga 6 compression haskell gzip

我有一个计算,与其他东西一起生成一些数据(很多),我想写入一个文件.

现在构建代码的方式是(简化):

writeRecord :: Handle -> Record -> IO ()
writeRecord h r = hPutStrLn h (toByteString r)
Run Code Online (Sandbox Code Playgroud)

然后在更大的计算期间定期调用该函数.它几乎就像一个日志,实际上,同时写入多个文件.

现在我希望压缩输出文件Gzip.在像Java这样的语言中我会做类似的事情:

outStream = new GzipOutputStream(new FileOutputStream(path)) 
Run Code Online (Sandbox Code Playgroud)

然后只会写入包装的输出流.

在Haskell中执行此操作的方式是什么?我觉得写点什么

writeRecord h r = hPut h ((compressed . toByteString) r)
Run Code Online (Sandbox Code Playgroud)

是不正确的,因为单独压缩每个小位是没有效率的(我甚至尝试过它,并且压缩文件的大小比未压缩的大小更大).

我也不认为我可以生成一个懒惰ByteString(甚至是一个块列表),然后用它来编写它,compressed . fromChunks因为这将需要我的"生成器"在内存中构建完整的东西.并且同时生成多个文件的事实使其更加复杂.

那么在Haskell中解决这个问题的方法是什么?写入文件并对它们进行gzip压缩?

Mic*_*ael 5

所有流媒体库都支持压缩.如果我了解特定问题以及您考虑的方式,io-streams可能是最简单的用途.在这里,我在写入trumpclinton输出流之间交替,它们被写为压缩文件.我接着展示pipes迈克尔conduit计划的等价物

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package io-streams
{-# LANGUAGE OverloadedStrings #-}

import qualified System.IO.Streams as IOS
import qualified System.IO as IO
import Data.ByteString (ByteString)

analyzer :: IOS.OutputStream ByteString -> IOS.OutputStream ByteString -> IO ()
analyzer clinton trump = do 
  IOS.write (Just "This is a string\n") clinton
  IOS.write (Just "This is a string\n") trump
  IOS.write (Just "Clinton string\n") clinton
  IOS.write (Just "Trump string\n") trump   
  IOS.write (Just "Another Clinton string\n") clinton
  IOS.write (Just "Another Trump string\n") trump   
  IOS.write Nothing clinton
  IOS.write Nothing trump

main:: IO ()
main = 
  IOS.withFileAsOutput "some-file-clinton.txt.gz" $ \clinton_compressed ->
  IOS.withFileAsOutput "some-file-trump.txt.gz" $ \trump_compressed -> do
     clinton <- IOS.gzip IOS.defaultCompressionLevel clinton_compressed
     trump <- IOS.gzip IOS.defaultCompressionLevel trump_compressed
     analyzer clinton trump
Run Code Online (Sandbox Code Playgroud)

很明显,你可以混合使用各种IOanalyzer写入两个输出流的行为之间-我只是显示在writeS,可以这么说.特别地,如果analyzer被理解为取决于输入流,则writes可以取决于read来自输入流的s.这是一个(略微!)更复杂的程序.如果我运行上面的程序,我明白了

$ stack gzip_so.hs  
$ gunzip some-file-clinton.txt.gz 
$ gunzip some-file-trump.txt.gz 
$ cat some-file-clinton.txt 
This is a string
Clinton string
Another Clinton string
$ cat some-file-trump.txt 
This is a string
Trump string
Another Trump string
Run Code Online (Sandbox Code Playgroud)

使用管道和导管,有各种方法可以实现上述效果,并且部件的分解程度更高.然而,写入单独的文件会更加微妙.在任何情况下,管道都相当于Michael S的管道程序:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc  --package pipes-zlib 
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import System.IO  (IOMode(..), withFile, Handle)
import Pipes  
import qualified Pipes.ByteString as PB
import qualified Pipes.GZip as P

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
    str <- someAction
    hPutStr h str

producerPipe :: MonadIO m => Producer ByteString m ()
producerPipe = do
    str <- liftIO someAction
    yield str

main :: IO ()
main =  withFile "some-file-pipes.txt.gz"  WriteMode $ \h -> 
     runEffect $ P.compress P.defaultCompression producerPipe  >-> PB.toHandle h 
Run Code Online (Sandbox Code Playgroud)

- 编辑

这里的价值是另一种将多个生产者用管道或管道叠加在一个线程上的方法,以增加Michael S danidiaz提到的不同方法并提到:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package pipes-zlib
{-# LANGUAGE OverloadedStrings #-}
import Pipes
import Pipes.GZip
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import System.IO
import Control.Monad (replicateM_)

producer = replicateM_ 50000 $ do
    marie  "This is going to Marie\n"  -- arbitary IO can be interspersed here
    arthur "This is going to Arthur\n" -- with liftIO
    sylvia "This is going to Sylvia\n" 
  where 
    marie = yield; arthur = lift . yield; sylvia = lift . lift . yield

sinkHelper h p = runEffect (compress bestSpeed p >-> Bytes.toHandle h)

main :: IO ()
main =  
   withFile "marie.txt.gz" WriteMode $ \marie ->
   withFile "arthur.txt.gz"  WriteMode $ \arthur -> 
   withFile "sylvia.txt.gz"  WriteMode $ \sylvia ->
      sinkHelper sylvia
      $ sinkHelper arthur
      $ sinkHelper marie
      $ producer
Run Code Online (Sandbox Code Playgroud)

它非常简单快速,并且可以通过显而易见的改动写入管道 - 但发现它自然涉及更高水平的"monad变换器堆栈"观点.从streaming图书馆这样的东西来看,这是编写这样一个程序最自然的方式.


Mic*_*man 4

使用管道执行此操作相当简单,但您需要稍微调整代码。我整理了一个前后代码的示例来演示它。基本思想是:

  • hPutStr h用。。。来代替yield
  • 添加一些liftIO包装纸
  • 不要使用withBinaryFile等,而是使用runConduitRes, gzip, andsinkFile

这是例子:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes)
import Data.Conduit.Binary (sinkFile)
import Data.Conduit.Zlib (gzip)
import System.IO (Handle)

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
    str <- someAction
    hPutStr h str

-- Conduit version
producerConduit :: MonadIO m => ConduitM i ByteString m ()
producerConduit = do
    str <- liftIO someAction
    yield str

main :: IO ()
main = runConduitRes $ producerConduit
                    .| gzip
                    .| sinkFile "some-file.txt.gz"
Run Code Online (Sandbox Code Playgroud)

您可以在管道教程中了解有关管道的更多信息。

你的 Java 想法很有趣,再给我几分钟,我会添加一个看起来更像这样的答案。

编辑

这是一个更接近您的 Java 风格方法的版本。它依赖于一个SinkFunc.hs可作为 Gist 提供的模块:https://gist.github.com/snoyberg/283154123d30ff9e201ea4436a5dd22d

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -Wall -Werror #-}
import Data.ByteString (ByteString)
import Data.Conduit ((.|))
import Data.Conduit.Binary (sinkHandle)
import Data.Conduit.Zlib (gzip)
import System.IO (withBinaryFile, IOMode (WriteMode))
import SinkFunc (withSinkFunc)

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

producerFunc :: (ByteString -> IO ()) -> IO ()
producerFunc write = do
    str <- someAction
    write str

main :: IO ()
main = withBinaryFile "some-file.txt.gz" WriteMode $ \h -> do
    let sink = gzip .| sinkHandle h
    withSinkFunc sink $ \write -> producerFunc write
Run Code Online (Sandbox Code Playgroud)

编辑2另外一个是为了更好地衡量,实际上用于ZipSink将数据流式传输到多个不同的文件。有很多不同的方法可以对此进行切片,但这是一种有效的方法:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.Trans.Resource (MonadResource)
import Data.ByteString (ByteString)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes, ZipSink (..))
import Data.Conduit.Binary (sinkFile)
import qualified Data.Conduit.List as CL
import Data.Conduit.Zlib (gzip)

data Output = Foo ByteString | Bar ByteString

fromFoo :: Output -> Maybe ByteString
fromFoo (Foo bs) = Just bs
fromFoo _ = Nothing

fromBar :: Output -> Maybe ByteString
fromBar (Bar bs) = Just bs
fromBar _ = Nothing

producer :: Monad m => ConduitM i Output m ()
producer = do
    yield $ Foo "This is going to Foo"
    yield $ Bar "This is going to Bar"

sinkHelper :: MonadResource m
           => FilePath
           -> (Output -> Maybe ByteString)
           -> ConduitM Output o m ()
sinkHelper fp f
    = CL.mapMaybe f
   .| gzip
   .| sinkFile fp

main :: IO ()
main = runConduitRes
     $ producer
    .| getZipSink
            (ZipSink (sinkHelper "foo.txt.gz" fromFoo) *>
             ZipSink (sinkHelper "bar.txt.gz" fromBar))
Run Code Online (Sandbox Code Playgroud)

  • 另外,一般来说,即使某些东西很“简单”,对于刚接触库的人来说,工作示例通常也更容易理解。 (3认同)
  • 我之前没有看到评论,我只是回答原来的问题。如果这确实是提问者所寻找的,那么它将涉及管道中的“ZipSink”之类的东西,或者我现在正在写的另一种方法作为概念证明。 (2认同)