所以我有一个8mb的文件,每个文件有6个整数,由一个空格隔开.
我目前解析这个的方法是:
tuplify6 :: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)
toInts :: String -> (Int, Int, Int, Int, Int, Int)
toInts line =
tuplify6 $ map read stringNumbers
where stringNumbers = split " " line
Run Code Online (Sandbox Code Playgroud)
并映射到Ints over
liftM lines . readFile
Run Code Online (Sandbox Code Playgroud)
这将返回一个元组列表.但是,当我运行它时,加载文件并解析它需要将近25秒.我有什么办法可以加快速度吗?该文件只是纯文本.
您可以使用ByteStrings 来加快速度,例如:
module Main (main) where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Char
main :: IO ()
main = do
args <- getArgs
mapM_ doFile args
doFile :: FilePath -> IO ()
doFile file = do
bs <- C.readFile file
let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs
print (length tups)
buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)]
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs
buildTups k acc bs
| C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k)
| otherwise = case C.readInt bs of
Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm
Nothing -> error ("No Int found: " ++ show (C.take 100 bs))
tuplify6:: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)
Run Code Online (Sandbox Code Playgroud)
跑得很快:
$ time ./fileParse IntList
200000
real 0m0.119s
user 0m0.115s
sys 0m0.003s
Run Code Online (Sandbox Code Playgroud)
对于8.1 MiB文件.
另一方面,使用Strings和你的转换(用几个seqs来强制评估)也只花了0.66秒,所以大部分时间似乎花费在不解析上,而是使用结果.
哎呀,错过了seq所以reads实际上没有评估String版本.使用来自@ Rotsor评论的自定义解析器修复,String+ read大约需要4秒钟Int
foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0
Run Code Online (Sandbox Code Playgroud)
所以解析显然确实占用了大量的时间.