LFSR实现中的高效比特

Question

LFSR实现中的高效比特

mcm*_*yer 6 haskell bit-manipulation bit-fields lfsr

虽然我有一个很好的LSFR C实现,我想我会在Haskell中尝试相同 - 只是为了看看它是怎么回事.到目前为止,我想出的是比C实现慢两个数量级,这引出了一个问题:性能如何得到改善？显而易见,这个小小的操作是瓶颈,而分析器确认了这一点.

这是使用列表的基线Haskell代码,并且Data.Bits:

import           Control.Monad      (when)
import           Data.Bits          (Bits, shift, testBit, xor, (.&.), (.|.))
import           System.Environment (getArgs)
import           System.Exit        (exitFailure, exitSuccess)

tap :: [[Int]]
tap = [
    [],            [],            [],            [3, 2],
    [4, 3],        [5, 3],        [6, 5],        [7, 6],
    [8, 6, 5, 4],  [9, 5],        [10, 7],       [11, 9],
    [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
    [16,15,13,4],  [17, 14],      [18, 11],      [19, 6, 2, 1],
    [20, 17],      [21, 19],      [22, 21],      [23, 18],
    [24,23,22,17], [25, 22],      [26, 6, 2, 1], [27, 5, 2, 1],
    [28, 25],      [29, 27],      [30, 6, 4, 1], [31, 28],
    [32,22,2,1],   [33,20],       [34,27,2,1],   [35,33],
    [36,25],       [37,5,4,3,2,1],[38,6,5,1],    [39,35],
    [40,38,21,19], [41,38],       [42,41,20,19], [43,42,38,37],
    [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
    [48,47,21,20], [49,40],       [50,49,24,23], [51,50,36,35],
    [52,49],       [53,52,38,37], [54,53,18,17], [55,31],
    [56,55,35,34], [57,50],       [58,39],       [59,58,38,37],
    [60,59],       [61,60,46,45], [62,61,6,5],   [63,62]        ]

xor' :: [Bool] -> Bool
xor' = foldr xor False

mask ::  (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1

advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
    | d0        = shifted
    | otherwise = shifted .|. 1
    where
        shifted = shift lfsr 1 .&. mask len
        d0 = xor' $ map (testBit lfsr) tap'
        tap' = map (subtract 1) tap

main :: IO ()
main = do
    args <- getArgs
    when (null args) $ fail "Usage: lsfr <number-of-bits>"
    let len = read $ head args
    when (len < 8) $ fail "No need for LFSR"
    let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
    if out == 0 then do
        putStr "OK\n"
        exitSuccess
    else do
        putStr "FAIL\n"
        exitFailure

Run Code Online (Sandbox Code Playgroud)

基本上,它测试对于任何给定的位长度定义的LSFR是否tap :: [[Int]]具有最大长度.(更确切地说,它只检查LSFR在2 ⁿ次迭代后是否达到初始状态(零).)

根据剖析器,最昂贵的线是反馈位d0 = xor' $ map (testBit lfsr) tap'.

到目前为止我尝试过的:

使用Data.Array:尝试放弃,因为没有foldl/r
使用Data.Vector:比基线快一点

我使用的编译器选项是:-O2,LTS Haskell 8.12 (GHC-8.0.2).

参考C++程序可以在gist.github.com上找到.

不能期望Haskell代码(？)以与C代码一样快的速度运行,但是两个数量级太多,必须有更好的方法来进行比特混乱.

更新:应用答案中建议的优化的结果

带有输入28的参考C++程序,使用LLVM 8.0.0编译,在我的机器上以0.67秒运行(与clang 3.7相同,速度稍慢,为0.68秒)
基线Haskell代码运行速度大约慢100倍(因为空间效率低,不会尝试使用大于25的输入)
随着@Thomas M. DuBuisson的重写,仍然使用默认的GHC后端,执行时间降至5.2秒
随着@Thomas M. DuBuisson的重写,现在使用LLVM后端(GHC选项-O2 -fllvm),执行时间降至1.7s
- 使用GHC选项-O2 -fllvm -optlc -mcpu=native使其达到0.73秒
当使用Thomas的代码(使用默认的"本机"后端和LLVM)时,iterate用iterate'@cirdec 替换没有任何区别.但是,使用基线代码时确实有所不同.

所以,我们来自100x到8x到1.09x,即比C慢9%!

注意 LLVM后端到GHC 8.0.2需要LLVM 3.7.在Mac OS X上,这意味着安装此版本,brew然后进行符号链接opt和llc.见7.10.GHC后端.

Answer 1

Tho*_*son 8

前面的事情

首先,我在Intel I5~2.5GHz,linux x86-64上使用GHC 8.0.1.

初稿:哦不!慢了!

您的参数25的起始代码运行:

% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main             ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25  7.25s user 0.50s system 99% cpu 7.748 total

Run Code Online (Sandbox Code Playgroud)

所以节拍的时间是77ms - 比这个Haskell代码好两个数量级.让我们潜入.

问题1:Shifty Code

我发现代码有些奇怪.首先是使用shift高性能代码.Shift支持左右移位,这样做需要分支.让我们用更可读的两个权限(shift 1 x〜> 2^x和shift x 1〜> 2*x)来杀死它:

% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main             ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25  0.64s user 0.00s system 99% cpu 0.637 total

Run Code Online (Sandbox Code Playgroud)

(正如你在评论中指出的那样:是的,这需要调查.可能是先前代码的一些奇怪之处在于阻止重写规则被触发,结果导致更糟糕的代码)

问题2:比特列表？Int操作拯救了一天!

一个变化,一个数量级.好极了.还有什么？好吧,你有这个尴尬的位置列表,你正在点击它似乎是在寻求效率低下和/或倾向于脆弱的优化.此时我会注意到,对该列表中的任何一个选项进行硬编码会产生非常好的性能(例如testBit lsfr 24 `xor` testBit lsfr 21),但我们需要更通用的快速解决方案.

我建议我们计算所有分接位置的掩码,然后执行单指令弹出计数.要做到这一点,我们只需要Int传入advance一个而不是整个列表.popcount指令需要良好的程序集生成,这需要llvm和可能-optlc-mcpu=native或另一个非悲观的指令集选择.

这一步给我们pc以下.advance在评论中提到的那个我已经弃去了警卫 -

let tp = sum $ map ((2^) . subtract 1) (tap !! len)
    pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
    mask = 2^len - 1
    advance' :: Int -> Int
    advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr 
    out :: Int
    out = last $ take (2^len) $ iterate advance' 0

Run Code Online (Sandbox Code Playgroud)

我们的表现是:

% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25      
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
OK
./so 25  0.06s user 0.00s system 96% cpu 0.067 total

Run Code Online (Sandbox Code Playgroud)

从开始到结束都超过两个数量级,所以希望它与你的C.匹配.最后,在部署的代码中,实际上通常使用带有C绑定的Haskell包,但这通常是一种教育练习,所以我希望你玩得开心.

编辑:现在可用的C++代码占用我的系统0.10(g++ -O3)和0.12(clang++ -O3 -march=native)秒,所以看起来我们已经击败了我们的标记.

Answer 2

Cir*_*dec 6

我怀疑在评估它之前,以下行正在内存中构建一个类似列表的大型thunk.

let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is

Run Code Online (Sandbox Code Playgroud)

让我们看看我是否正确,如果我是,我们将解决它.第一个调试步骤是了解程序使用的内存.要做到这一点,我们将使用-rtsopts除了之外的选项进行编译-O2.这使得能够使用RTS选项运行程序,+RTS -s其中包含输出小内存摘要的内容.

初始表现

运行程序时,lfsr 25 +RTS -s我得到以下输出

OK
   5,420,148,768 bytes allocated in the heap
   6,705,977,216 bytes copied during GC
   1,567,511,384 bytes maximum residency (20 sample(s))
     357,862,432 bytes maximum slop
            3025 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10343 colls,     0 par    2.453s   2.522s     0.0002s    0.0009s
  Gen  1        20 colls,     0 par    2.281s   3.065s     0.1533s    0.7128s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.438s  (  1.162s elapsed)
  GC      time    4.734s  (  5.587s elapsed)
  EXIT    time    0.016s  (  0.218s elapsed)
  Total   time    6.188s  (  6.967s elapsed)

  %GC     time      76.5%  (80.2% elapsed)

  Alloc rate    3,770,538,273 bytes per MUT second

  Productivity  23.5% of total user, 19.8% of total elapsed

Run Code Online (Sandbox Code Playgroud)

这是一次使用的大量内存.很可能在那里有一个巨大的笨蛋.

试图减少thunk尺寸

我假设thunk正在建造中iterate (advance ...).如果是这种情况,我们可以尝试通过advance在其lsfr参数中设置更严格来减少thunk大小.这不会删除thunk的脊柱(连续迭代),但它可能会减少脊柱评估时构建的状态的大小.

BangPatterns是一种在参数中使函数严格的简单方法.f !x = ..是简写f x = seq x $ ...

{-# LANGUAGE BangPatterns #-}

advance :: Int -> [Int] -> Int -> Int
advance len tap = go
  where
    go !lfsr
      | d0        = shifted
      | otherwise = shifted .|. 1
      where
        shifted = shift lfsr 1 .&. mask len
        d0 = xor' $ map (testBit lfsr) tap'
    tap' = map (subtract 1) tap

Run Code Online (Sandbox Code Playgroud)

让我们看看这有什么不同......

>lfsr 25 +RTS -s
OK
   5,420,149,072 bytes allocated in the heap
   6,705,979,368 bytes copied during GC
   1,567,511,448 bytes maximum residency (20 sample(s))
     357,862,448 bytes maximum slop
            3025 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10343 colls,     0 par    2.688s   2.711s     0.0003s    0.0059s
  Gen  1        20 colls,     0 par    2.438s   3.252s     0.1626s    0.8013s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.328s  (  1.146s elapsed)
  GC      time    5.125s  (  5.963s elapsed)
  EXIT    time    0.000s  (  0.226s elapsed)
  Total   time    6.484s  (  7.335s elapsed)

  %GC     time      79.0%  (81.3% elapsed)

  Alloc rate    4,081,053,418 bytes per MUT second

  Productivity  21.0% of total user, 18.7% of total elapsed

Run Code Online (Sandbox Code Playgroud)

没有那么明显.

消除脊柱

我想这iterate (advance ...)是正在构建的内容.毕竟,对于我正在运行的命令,列表将是2^25,或者长达3300多万个项目.列表融合可能正在删除列表本身,但列表最后一项的thunk超过3300万个应用程序advance ...

为了解决这个问题,我们需要一个严格的版本,iterate以便Int在advance再次应用函数之前将值强制为a .这应该使内存一次只保留一个lfsr值,以及当前计算的应用程序advance.

不幸的是,没有严格iterate的Data.List.这是一个不放弃列表融合的人,它为这个问题提供了其他重要的(我认为)性能优化.

{-# LANGUAGE BangPatterns #-}

import GHC.Base (build)

{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
  where go !x = x : go (f x)

{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
  where go !x = x `c` go (f x)

{-# RULES
"iterate'"    [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'"  [1]              iterateFB' (:) = iterate'
 #-}

Run Code Online (Sandbox Code Playgroud)

这只是iterate来自GHC.List(及其所有重写规则),但在累积的参数中是严格的.

配备严格的迭代,iterate'我们可以改变麻烦的线路

let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0

Run Code Online (Sandbox Code Playgroud)

我希望这会表现得更好.让我们来看看 ...

>lfsr 25 +RTS -s
OK
   3,758,156,184 bytes allocated in the heap
         297,976 bytes copied during GC
          43,800 bytes maximum residency (1 sample(s))
          21,736 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7281 colls,     0 par    0.047s   0.008s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.750s  (  0.783s elapsed)
  GC      time    0.047s  (  0.008s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.797s  (  0.792s elapsed)

  %GC     time       5.9%  (1.0% elapsed)

  Alloc rate    5,010,874,912 bytes per MUT second

  Productivity  94.1% of total user, 99.0% of total elapsed

Run Code Online (Sandbox Code Playgroud)

这使用0.00002了很多内存,运行速度提高了10倍.

我不知道这是否会改善Thomas DeBuisson的答案,advance但答案有所改善,但仍然留下了懒惰iterate advance'.这很容易检查; 将iterate'代码添加到该答案并iterate'代替iterate该答案使用.

归档时间：	8 年，6 月前
查看次数：	284 次
最近记录：	7 年，2 月前