ByteString直方图

Mat*_*hid 2 haskell bytestring

给定(严格)ByteString,计算它包含的每个可能字节的数量的最有效方法是什么?

我认为这sort应该被实现为计数排序 - 但似乎没有办法访问相关的计数.我还看到有一个count函数,它计算给定字节出现的次数.这给了我以下选项:

  • map (\ b -> count b str) [0x00 .. 0xFF]
  • map length . group . sort
  • 一些与fold*IntMap字节频率.

哪个可能会给我最好的表现?

Dan*_*her 5

dflemstr的基本思想当然是正确的,但由于你想要最好的性能,你需要使用unchecked访问ByteString以及计数数组,如

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.Unboxed

import Data.Word

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.ByteString.Unsafe

histogram :: ByteString -> UArray Word8 Int
histogram bs = runSTUArray $ do
    hist <- newArray (0, 255) 0
    let l = BS.length bs
        update b = do
            o <- unsafeRead hist b
            unsafeWrite hist b (o+1)
        loop i
            | i < 0     = return hist
            | otherwise = do
                update $ fromIntegral (bs `unsafeIndex` i)
                loop (i-1)
    loop (l-1)
Run Code Online (Sandbox Code Playgroud)

根据criterion(构建一个200000长的直方图),这会产生相当大的差异ByteString:

warming up
estimating clock resolution...
mean is 1.667687 us (320001 iterations)
found 3078 outliers among 319999 samples (1.0%)
  1947 (0.6%) high severe
estimating cost of a clock call...
mean is 40.43765 ns (14 iterations)

benchmarking dflemstr
mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
variance introduced by outliers: 74.820%
variance is severely inflated by outliers

benchmarking unsafeIndex
mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
variance introduced by outliers: 88.342%
variance is severely inflated by outliers
Run Code Online (Sandbox Code Playgroud)

(我改变了dflemstr的代码也使用runSTUArray并返回a UArray Word8 Int来获得uiform返回值,但这对运行时间没有太大影响.)

  • @josejuan在这种情况下,它只是一个不幸的名称选择,它应该是`unsafeIfYouDidn'tCheckTheValidityOfIndexButPerfectlyFineIfYouDidRead/Write/Index`,但这太长了.这与"unsafePerformIO"完全不同,是"不安全". (2认同)