在计算大文件中的字符时耗尽内存

Question

在计算大文件中的字符时耗尽内存

我想计算一个大文件中每个字符的出现次数.虽然我知道计数应该在Haskell中严格实现(我试图使用foldl实现),但我仍然没有内存.作为比较:文件大小约为2GB,而计算机有100GB内存.该文件中没有很多不同的字符 - 也许是20.我做错了什么？

ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
    | c == d = (c,i+1):cs
    | otherwise = (c,i) : ins cs d

main = do
    [file] <- getArgs
    txt <- readFile file
    print $ foldl' ins [] txt

Run Code Online (Sandbox Code Playgroud)

Answer 1

beh*_*uri 7

你的ins功能是创造大量的thunk,导致大量的内存泄漏.foldl'只评估弱头正常形式,这在这里是不够的.你需要的是deepseq从Control.DeepSeq为了得到正常形态.

或者,代替关联列表,Data.Map.Strict用于计数.此外,如果您的IO大约为2GB,则最好使用惰性ByteString而不是纯字符串.

无论输入大小如何,Bellow代码都应在恒定的内存空间中执行:

import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
  where
  go = flip $ alter inc
  inc :: Maybe Int -> Maybe Int
  inc Nothing  = Just 1
  inc (Just i) = Just $ i + 1

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	107 次
最近记录：	9 年，1 月前