sta*_*tti 10 performance haskell functional-programming immutability
作为一个小练习,我在haskell中计算了以下单词计数程序.它计算文本文件中的不同单词,并输出50个最常用的单词及其频率.
import qualified Data.Map as Map
import Data.List.Split
import Data.List
import Data.Ord
-- Count words
count = Map.toList . foldl' increment Map.empty
where
increment dict k = Map.insert k (1 + Map.findWithDefault 0 k dict) dict
-- Sort the counts
countAndSort = sortBy (flip $ comparing snd) . count
-- Pretty printing
pp :: Show a => [(String,a)] -> IO()
pp = putStrLn . foldl' format "" where
format text (x,y) = text ++ "\n" ++ x ++ "\t" ++ show y
main = readFile "pg13951.txt" >>= pp . take 50 .countAndSort . splitOn " "
Run Code Online (Sandbox Code Playgroud)
问题是,它比使用可变dict的python实现慢16倍:
def increment(dic,word):
dic[word] = dic.get(word,0) + 1
return dic
print sorted(reduce(increment,open("pg13951.txt").read().split(),{}).items(),key=lambda e:-e[1])[:50]
Run Code Online (Sandbox Code Playgroud)
我认为这个问题是由于ghc一次又一次地重复使用同一个地图时,它会不断地重新定位新地图.运行时统计显示了很多分配:
$ ghc -rtsopts count.hs
$ ./count +RTS -sstderr
de 7682
et 4423
la 4238
<snip>
d'Artagnan 511
M. 502
c'est 443
d'Artagnan, 443
705,888,048 bytes allocated in the heap
655,511,720 bytes copied during GC
139,823,800 bytes maximum residency (10 sample(s))
1,049,416 bytes maximum slop
287 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1366 colls, 0 par 2.16s 2.26s 0.0017s 0.0072s
Gen 1 10 colls, 0 par 2.86s 3.09s 0.3093s 1.5055s
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.18s ( 3.36s elapsed)
GC time 5.02s ( 5.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.20s ( 8.72s elapsed)
%GC time 61.2% (61.4% elapsed)
Alloc rate 221,831,366 bytes per MUT second
Productivity 38.8% of total user, 36.5% of total elapsed
Run Code Online (Sandbox Code Playgroud)
我的问题是:有没有办法让这个程序表现得更好,而不需要使用脏的技巧,比如在IO monad中工作,使用可变数据结构等等?
PS:数据文件可从以下URL获得:http://www.gutenberg.org/cache/epub/13951/pg13951.txt
sha*_*ang 18
以下是我尝试过的一些快速简单的优化.
我的机器上的原始版本:
real 0m1.539s
user 0m1.452s
sys 0m0.076s
Run Code Online (Sandbox Code Playgroud)
而不是使用insert
,foldl'
你可以fromListWith
用来计算单词.
count = Map.toList . Map.fromListWith (+) . flip zip (repeat 1)
Run Code Online (Sandbox Code Playgroud)
这是两倍快.
real 0m0.687s
user 0m0.648s
sys 0m0.032s
Run Code Online (Sandbox Code Playgroud)该String
类型是一个链接的字符列表,这使得操作字符串相当优雅但效率低下.我们可以使用该Text
类型来获得更有效的字符串处理.我还重写了你的pp
函数来unlines
代替foldl'
和使用words
而不是splitOn
原始的拆分.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
pp :: Show a => [(Text,a)] -> IO()
pp = T.putStrLn . T.unlines . map format where
format (x,y) = x <> "\t" <> (T.pack $ show y)
main = T.readFile "pg13951.txt" >>= pp . take 50 .countAndSort . T.words
Run Code Online (Sandbox Code Playgroud)
再次,速度是上一步的两倍.
real 0m0.330s
user 0m0.316s
sys 0m0.008s
Run Code Online (Sandbox Code Playgroud)使用严格版本 Map
import qualified Data.Map.Strict as Map
Run Code Online (Sandbox Code Playgroud)
速度增加约20%
real 0m0.265s
user 0m0.252s
sys 0m0.008s
Run Code Online (Sandbox Code Playgroud)