Mat*_*hid 2 haskell bytestring
给定(严格)ByteString,计算它包含的每个可能字节的数量的最有效方法是什么?
我认为这sort应该被实现为计数排序 - 但似乎没有办法访问相关的计数.我还看到有一个count函数,它计算给定字节出现的次数.这给了我以下选项:
map (\ b -> count b str) [0x00 .. 0xFF]map length . group . sortfold*和IntMap字节频率.哪个可能会给我最好的表现?
dflemstr的基本思想当然是正确的,但由于你想要最好的性能,你需要使用unchecked访问ByteString以及计数数组,如
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.Unboxed
import Data.Word
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.ByteString.Unsafe
histogram :: ByteString -> UArray Word8 Int
histogram bs = runSTUArray $ do
hist <- newArray (0, 255) 0
let l = BS.length bs
update b = do
o <- unsafeRead hist b
unsafeWrite hist b (o+1)
loop i
| i < 0 = return hist
| otherwise = do
update $ fromIntegral (bs `unsafeIndex` i)
loop (i-1)
loop (l-1)
Run Code Online (Sandbox Code Playgroud)
根据criterion(构建一个200000长的直方图),这会产生相当大的差异ByteString:
warming up
estimating clock resolution...
mean is 1.667687 us (320001 iterations)
found 3078 outliers among 319999 samples (1.0%)
1947 (0.6%) high severe
estimating cost of a clock call...
mean is 40.43765 ns (14 iterations)
benchmarking dflemstr
mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
variance introduced by outliers: 74.820%
variance is severely inflated by outliers
benchmarking unsafeIndex
mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
variance introduced by outliers: 88.342%
variance is severely inflated by outliers
Run Code Online (Sandbox Code Playgroud)
(我改变了dflemstr的代码也使用runSTUArray并返回a UArray Word8 Int来获得uiform返回值,但这对运行时间没有太大影响.)