优化Haskell代码

Mas*_*sse 16 optimization performance haskell

我正在尝试学习Haskell,在关于Markov文本链的reddit文章之后,我决定首先在Python中实现Markov文本生成,现在在Haskell中实现.但是我注意到我的python实现比Haskell版本快,甚至Haskell编译为本机代码.我想知道我应该做些什么来使Haskell代码运行得更快,而且现在我认为它因为使用Data.Map而不是hashmaps而慢得多,但我不确定

我也会发布Python代码和Haskell.使用相同的数据,Python需要大约3秒钟,Haskell接近16秒.

不言而喻,我会接受任何建设性的批评:).

import random
import re
import cPickle
class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
        picklefd.close()

    def train(self, text):
        splitted = re.findall(r"(\w+|[.!?',])", text)
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

    def concat(self, words):
        sentence = ""
        for word in words:
            if word in "'\",?!:;.":
                sentence = sentence[0:-1] + word + " "
            else:
                sentence += word + " "
        return sentence

    def pickword(self, words):
        temp = [(k, words[k]) for k in words]
        results = []
        for (word, n) in temp:
            results.append(word)
            if n > 1:
                for i in xrange(n-1):
                    results.append(word)
        return random.choice(results)

    def gentext(self, words):
        allwords = [k for k in self.cache]
        (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
        sentence = [first, second]
        while len(sentence) < words or sentence[-1] is not ".":
            current = (sentence[-2], sentence[-1])
            if current in self.cache:
                followup = self.pickword(self.cache[current])
                sentence.append(followup)
            else:
                print "Wasn't able to. Breaking"
                break
        print self.concat(sentence)

Markov(["76.txt"])
Run Code Online (Sandbox Code Playgroud)

-

module Markov
( train
, fox
) where

import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) = 
     let l = train (y:z:xs)
     in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l

main = do
  contents <- B.readFile "76.txt"
  print $ train $ B.words contents

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
Run Code Online (Sandbox Code Playgroud)

Don*_*art 11

a)你是如何编译它的?(ghc -O2?)

b)哪个版本的GHC?

c)Data.Map非常有效,但你可以被欺骗进入延迟更新 - 使用insertWith',而不是insertWithKey.

d)不要将bytestrings转换为String.将它们保存为字节串,并将它们存储在Map中


Nor*_*sey 9

Data.Map是在类Ord比较需要恒定时间的假设下设计的.对于字符串键,情况可能并非如此 - 当字符串相等时,情况绝对不是这样.您可能会也可能不会遇到此问题,具体取决于您的语料库的大小以及有多少单词具有共同的前缀.

我很想尝试一种设计用于序列键操作的数据结构,例如Don Stewartbytestring-trie友好建议的包.

  • 字节串trie?http://hackage.haskell.org/package/bytestring-trie (3认同)

Ant*_*ony 7

我试图避免做任何花哨或微妙的事情.这只是进行分组的两种方法; 第一个强调模式匹配,第二个强调模式匹配.

import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
    where go (x:y:[]) m = m
          go (x:y:z:xs) m = let addWord Nothing   = Just $ M.singleton z 1
                                addWord (Just m') = Just $ M.alter inc z m'
                                inc Nothing    = Just 1
                                inc (Just cnt) = Just $ cnt + 1
                            in go (y:z:xs) $ M.alter addWord (x,y) m

train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.alter (addWord z) (x,y) m
          addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
          inc = Just . maybe 1 (+1)

main = do contents <- B.readFile "76.txt"
          let db = train3 $ B.words contents
          print $ "Built a DB of " ++ show (M.size db) ++ " words"
Run Code Online (Sandbox Code Playgroud)

我认为它们都比原始版本快,但不可否认我只是针对我找到的第一个合理的语料库尝试过它们.

编辑根据Travis Brown的非常有效的观点,

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1
Run Code Online (Sandbox Code Playgroud)