在Haskell中从列表中无需替换的随机样本的更好方法

aje*_*eck 8 haskell list

我需要从较长的列表中随机取样而不进行替换(每个元素仅在样本中出现一次).我正在使用下面的代码,但现在我想知道:

  1. 是否有库函数可以做到这一点?
  2. 我该如何改进这段代码?(我是一个Haskell初学者,所以即使有一个库函数,这也很有用).

抽样的目的是能够概括从样本分析到人群的结果.

import System.Random

-- | Take a random sample without replacement of size size from a list.
takeRandomSample :: Int -> Int -> [a] -> [a]
takeRandomSample seed size xs
    | size < hi  = subset xs rs
    | otherwise = error "Sample size must be smaller than population."
    where
        rs = randomSample seed size lo hi
        lo = 0
        hi = length xs - 1

getOneRandomV g lo hi = randomR (lo, hi) g

rsHelper size lo hi g x acc
    | x `notElem` acc && length acc < size = rsHelper size lo hi new_g new_x (x:acc)
    | x `elem` acc && length acc < size = rsHelper size lo hi new_g new_x acc
    | otherwise = acc
    where (new_x, new_g) = getOneRandomV g lo hi

-- | Get a random sample without replacement of size size between lo and hi.
randomSample seed size lo hi = rsHelper size lo hi g x [] where
(x, g)  = getOneRandomV (mkStdGen seed) lo hi

subset l = map (l !!) 
Run Code Online (Sandbox Code Playgroud)

jto*_*bin 6

这是Daniel Fischer在评论中建议的快速"背后"实现,使用我首选的PRNG(mwc-random):

{-# LANGUAGE BangPatterns #-}

module Sample (sample) where

import Control.Monad.Primitive
import Data.Foldable (toList)
import qualified Data.Sequence as Seq
import System.Random.MWC

sample :: PrimMonad m => [a] -> Int -> Gen (PrimState m) -> m [a]
sample ys size = go 0 (l - 1) (Seq.fromList ys) where
    l = length ys
    go !n !i xs g | n >= size = return $! (toList . Seq.drop (l - size)) xs
                  | otherwise = do
                      j <- uniformR (0, i) g
                      let toI  = xs `Seq.index` j
                          toJ  = xs `Seq.index` i
                          next = (Seq.update i toI . Seq.update j toJ) xs
                      go (n + 1) (i - 1) next g
{-# INLINE sample #-}
Run Code Online (Sandbox Code Playgroud)

这几乎是对R的内部C版本的(简洁)功能重写,sample()因为它被称为无需替换.

sample它只是一个递归工作函数的包装器,它逐渐改变填充,直到达到所需的样本大小,只返回那么多的混洗元素.编写这样的函数可以确保GHC可以内联它.

它易于使用:

*Main> create >>= sample [1..100] 10
[51,94,58,3,91,70,19,65,24,53]
Run Code Online (Sandbox Code Playgroud)

生产版本可能希望使用类似可变向量的东西而不是Data.Sequence为了减少执行GC所花费的时间.