我是Haskell的新手,我一直坚持效率问题.
任务是:从4GB文本文件构建CSV文件,其中列具有恒定大小
列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽,等等...
文件只能包含[A-Z0-9] ASCII字符,因此转义单元格没有意义
I have:
$ cat example.txt
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)
I need:
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...
Run Code Online (Sandbox Code Playgroud)
这是我在Haskell中最快的代码,大约需要2分钟来处理整个4GB文件.
我需要最多30秒
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List
col_sizes = intercalate [1] $ map (`replicate` 0) cs
where
cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, …Run Code Online (Sandbox Code Playgroud)