随机抽样的字符向量,没有相互加前缀的元素

jba*_*ums 44 performance r combinatorics

考虑一个字符向量,pool其元素是(零填充)二进制数,最多为max_len数字.

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))

pool
##  [1] "0"    "1"    "00"   "10"   "01"   "11"   "000"  "100"  "010"  "110" 
## [11] "001"  "101"  "011"  "111"  "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"
Run Code Online (Sandbox Code Playgroud)

我想品尝n这些元素,与约束,没有一个采样的元素是前缀的任何其他采样元素(即如果我们品尝1101,我们禁止1,11110,而如果我们品尝1,我们禁止这些元素开始用1,例如10,11,100等等).

以下是我尝试使用的while,但当n大时(或接近时2^max_len),这当然很慢.

set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
  prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
  pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
    paste0('^', chosen[!prefixes]), pool)) == 0]
  chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}

chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111"  "0000" "0110" "1100" "0111"
Run Code Online (Sandbox Code Playgroud)

这可以通过最初从pool那些元素中去除那些元素来略微改善,这些元素的包含意味着剩余的元素不足以pool获取总样本的大小n.例如,当max_len = 4n > 9,我们可以立即删除01从中pool,因为通过包括其中一个,最大样本将是9(或者0以8开头的8个4字符元素1,或者1以8开头的8个4字符元素0).

基于这个逻辑,我们可以pool在获取初始样本之前省略以下内容:

pool <- pool[
  nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]
Run Code Online (Sandbox Code Playgroud)

谁能想到更好的方法?我觉得我忽略了一些更简单的事情.


编辑

为了澄清我的意图,我将把游戏池描绘成一组分支,其中交汇点和提示是节点(元素pool).假设绘制了下图中的黄色节点(即010).现在,从池中删除整个红色"分支",它由节点0,01和010组成.这就是我的意思,禁止对我们样本中已经存在"前缀"节点的节点进行采样(以及我们样本中已经节点为前缀的节点).

在此输入图像描述

如果采样节点位于分支的中间位置(如下图中的01),则不允许所有红色节点(0,01,010和011),因为0前缀为01,01前缀为010和011.

在此输入图像描述

我并不想品尝无论是 1 在每个路口0(即沿着树枝在叉翻转硬币步行) -这是罚款,有两个样品中,只要:(1)父母(或祖父母等.)或节点的子节点(孙子节点等)尚未被采样; (2)在对节点进行采样时,将剩余足够的节点以实现所需的大小样本n.

在上面的第二个图中,如果010是第一个选择,则黑色节点上的所有节点仍然(当前)有效,假设n <= 4.例如,如果n==4我们接下来对节点1进行采样(因此我们的选择现在包括01和1),我们随后将禁止节点00(由于上面的规则2)但仍然可以选择000和001,为我们提供我们的4元素样本.n==5另一方面,如果在此阶段不允许节点1.

Bro*_*ieG 19

介绍

这是我们在另一个答案中实现的字符串算法的数字变体.它更快,不需要创建或排序池.

算法大纲

我们可以使用整数来表示二进制字符串,这极大地简化了池生成和顺序消除值的问题.例如,有了max_len==3,我们可以将数字1--(-表示填充)表示4为十进制.此外,我们可以确定的是需要淘汰,如果我们选择这个号码的数字是那些之间44 + 2 ^ x - 1.这里x是填充元件(2在这种情况下)的数目,所以编号,以消除被之间44 + 2 ^ 2 - 1(或之间47,表示为100,110,和111).

为了完全匹配您的问题,我们需要一点点皱纹,因为您将二进制中可能相同的数字视为算法某些部分的不同.例如100,10-1--都是相同的号码,但需要在你的方案区别对待.在一个max_len==3世界中,我们有8个可能的数字,但有14个可能的表示:

0 - 000: 0--, 00-
1 - 001:
2 - 010: 01-
3 - 011:
4 - 100: 1--, 10-
5 - 101:
6 - 110: 11-
7 - 111:
Run Code Online (Sandbox Code Playgroud)

因此0和4有三种可能的编码,2和6有两种,所有其他只有一种.我们需要生成一个整数池,表示具有多个表示的数字的更高选择概率,以及跟踪该数字包括多少空白的机制.我们可以通过在数字末尾添加几位来表示我们想要的权重.所以我们的数字变成了(我们在这里使用两位):

jbaum | int | bin | bin.enc | int.enc    
  0-- |   0 | 000 |   00000 |       0
  00- |   0 | 000 |   00001 |       1      
  000 |   0 | 000 |   00010 |       2      
  001 |   1 | 001 |   00100 |       3      
  01- |   2 | 010 |   01000 |       4  
  010 |   2 | 010 |   01001 |       5  
  011 |   3 | 011 |   01101 |       6  
  1-- |   4 | 100 |   10000 |       7  
  10- |   4 | 100 |   10001 |       8  
  100 |   4 | 100 |   10010 |       9  
  101 |   5 | 101 |   10100 |      10  
  11- |   6 | 110 |   11000 |      11   
  110 |   6 | 110 |   11001 |      12   
  111 |   7 | 111 |   11100 |      13
Run Code Online (Sandbox Code Playgroud)

一些有用的属性:

  • enc.bits 表示编码需要多少位(在本例中为两位)
  • int.enc %% enc.bits 告诉我们明确指定了多少个数字
  • int.enc %/% enc.bits 回报 int
  • int * 2 ^ enc.bits + explicitly.specified 回报 int.enc

请注意,explicitly.specified这是在我们的实现之间0和之间,max_len - 1因为始终至少指定了一个数字.我们现在有一个仅使用整数完全表示数据结构的编码.我们可以从整数中采样并使用正确的权重等来重现您期望的结果.这种方法的一个限制是我们在R中使用32位整数,并且我们必须为编码保留一些位,因此我们将自己限制为max_len==25或者.如果你使用双精度浮点数指定的整数,你可能会变大,但我们在这里没有这样做.

避免重复选择

有两种粗略的方法可以确保我们不会两次选择相同的值

  1. 跟踪可供选择的值,并从中随机抽样
  2. 从所有可能的值中随机抽样,然后检查该值是否先前已被选中,如果已经过,则再次采样

虽然第一个选项看起来最干净,但实际上它的计算成本非常高.它要求对每个拾取进行矢量扫描,以预先取消所选值的取消,或者创建包含非取消限定值的收缩矢量.如果通过C代码通过引用使向量缩小,则收缩选项仅比向量扫描更有效,但即使这样,它也需要对向量的潜在大部分进行重复转换,并且它需要C.

这里我们使用方法#2.这允许我们随机地随机改变可能值的范围,然后按顺序选择每个值,检查它是否已被取消资格,如果有,则选择另一个,等等.这是有效的,因为检查是否是一个微不足道的由于我们的值编码,我们选择了一个值; 我们可以仅根据值推断排序表中值的位置.因此,我们在排序表中记录每个值的状态,并可以通过直接索引访问更新或查找该状态(无需扫描).

例子

在基础R中实现该算法可作为要点.这个特定的实现只能提取完整的绘图.以下是来自max_len==4池中的8个8个元素的抽样示例:

# each column represents a draw from a `max_len==4` pool

set.seed(6); replicate(10, sample0110b(4, 8))
     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10] 
[1,] "1000" "1"    "0011" "0010" "100"  "0011" "0"    "011"  "0100" "1011"
[2,] "111"  "0000" "1101" "0000" "0110" "0100" "1000" "00"   "0101" "1001"
[3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10"   "1100"
[4,] "0100" "0010" "0000" "0101" "1101" "101"  "1011" "1101" "0110" "1101"
[5,] "101"  "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111"
[6,] "110"  "0111" "1011" "111"  "1011" "110"  "1111" "0100" "0011" "000" 
[7,] "0101" "0101" "111"  "011"  "1010" "1000" "1100" "101"  "0001" "0101"
[8,] "011"  "0001" "01"   "1010" "0011" "1110" "1110" "1001" "110"  "1000"
Run Code Online (Sandbox Code Playgroud)

我们最初还有两个依赖方法#1来避免重复的实现,一个在基础R中,一个在C中,但即使是C版本也不如新的基本R版本那么快n.这些函数确实能够绘制不完整的绘图,所以我们在这里提供它们以供参考:

比较基准

以下是一组基准测试,比较了此Q/A中显示的几个功能.以毫秒为单位的时间.该brodie.b版本是本答案中描述的版本.brodie是最初的实现,brodie.C是原始的一些C.所有这些都强制要求完整的样本. brodie.str是另一个答案中基于字符串的版本.

   size    n  jbaum josilber  frank tensibai brodie.b brodie brodie.C brodie.str
1     4   10     11        1      3        1        1      1        1          0
2     4   50      -        -      -        1        -      -        -          1
3     4  100      -        -      -        1        -      -        -          0
4     4  256      -        -      -        1        -      -        -          1
5     4 1000      -        -      -        1        -      -        -          1
6     8   10      1      290      6        3        2      2        1          1
7     8   50    388        -      8        8        3      4        3          4
8     8  100  2,506        -     13       18        6      7        5          5
9     8  256      -        -     22       27       13     14       12          6
10    8 1000      -        -      -       27        -      -        -          7
11   16   10      -        -    615      688       31     61       19        424
12   16   50      -        -  2,123    2,497       28    276       19      1,764
13   16  100      -        -  4,202    4,807       30    451       23      3,166
14   16  256      -        - 11,822   11,942       40  1,077       43      8,717
15   16 1000      -        - 38,132   44,591       83  3,345      130     27,768
Run Code Online (Sandbox Code Playgroud)

这对于较大的池来说相对较好

system.time(sample0110b(18, 100000))
   user  system elapsed 
  8.441   0.079   8.527 
Run Code Online (Sandbox Code Playgroud)

基准备注:

  • frank和brodie(减去brodie.str)不需要任何预生成池,这将影响比较(见下文)
  • Josilber是LP版本
  • jbaum是OP的例子
  • 如果池为空,则会略微修改tensibai以退出而不是失败
  • 没有设置运行python,所以无法比较/帐户缓冲
  • - 代表不可行的选择或合理的时间太慢

定时不包括拉伸该池(0.8,2.5,401分别毫秒大小4,816),这是必要的jbaum,josilberbrodie.str运行时,或对它们进行排序(0.1,2.7,3700毫秒大小4,816),这是必要的brodie.str,除了到平局.是否要包含这些取决于您为特定池运行该函数的次数.此外,几乎可以肯定有更好的方法来生成/排序池.

这是三次运行的中间时间microbenchmark.该代码是可以作为一个依据,但请注意,你必须加载sample0110b,sample0110,sample01101,和sample01事先功能.


Ten*_*bai 15

我发现这个问题很有趣,所以我尝试用R中的低技能来获得这个(所以这可能会有所改进):

更新的编辑版本,感谢@Franck建议:

library(microbenchmark)
library(lineprof)

max_len <- 16
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
n<-100

library(stringr)
tree_sample <- function(samples,pool) {
  results <- vector("integer",samples)
  # Will be used on a regular basis, compute it in advance
  PoolLen <- str_length(pool)
  # Make a mask vector based on the length of each entry of the pool
  masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)

  # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
  # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
  # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
  integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)

  # Create a vector to filter the available value to sample
  ok <- rep(TRUE,length(pool))

  #Precompute the result of the bitwise and betwwen our integer pool and the masks   
  MaskedPool <- bitwAnd(integerPool,masks)

  while(samples) {
    samp <- sample(pool[ok],1) # Get a sample
    results[samples] <- samp # Store it as result
    ok[pool == samp] <- FALSE # Remove it from available entries

    vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
    mlen <- str_length(samp) # Get sample len

    #Creation of unitary mask to remove childs of sample
    mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)

    # Get the result of bitwise And between the integerPool and the sample mask 
    FilterVec <- bitwAnd(integerPool,mask)

    # Get the bitwise and result of the sample and it's mask
    Childm <- bitwAnd(vsamp,mask)

    ok[FilterVec == Childm] <- FALSE  # Remove from available entries the childs of the sample
    ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching

    samples <- samples -1
  }
  print(results)
}
microbenchmark(tree_sample(n,pool),times=10L)
Run Code Online (Sandbox Code Playgroud)

主要思想是使用位掩码比较来知道一个样本是否是另一个样本的父(公共位部分),如果是,则从池中抑制该元素.

现在需要花费1,4秒才能在我的机器上从长度为16的池中抽取100个样本.


Bro*_*ieG 13

您可以对池进行排序,以帮助确定取消资格的要素.例如,查看三元素排序池:

 [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
[13] "110" "111"
Run Code Online (Sandbox Code Playgroud)

我可以告诉我,我可以取消所选项目之后的任何内容,该项目的字符数多于我的项目,直到具有相同数字或更少字符的第一项.例如,如果我选择"01",我可以立即看到接下来的两个项目("010","011")需要删除,但不能删除后面的项目,因为"1"的字符较少.之后删除"0"很容易.这是一个实现:

library(fastmatch)  # could use `match`, but we repeatedly search against same hash

# `pool` must be sorted!

sample01 <- function(pool, n) {
  picked <- logical(length(pool))
  chrs <- nchar(pool)
  pick.list <- character(n)
  pool.seq <- seq_along(pool)

  for(i in seq(n)) {
    # Make sure pool not exhausted

    left <- which(!picked)
    left.len <- length(left)
    if(!length(left)) break

    # Sample from pool

    seq.left <- seq.int(left)
    pool.left <- pool[left]
    chrs.left <- chrs[left]
    pick <- sample(length(pool.left), 1L)

    # Find all the elements with more characters that are disqualified
    # and store their indices in `valid` (bad name...)

    valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
    first.invalid <- which(!valid.tmp & seq.left > pick)
    valid <- if(length(first.invalid)) {
      pick:(first.invalid[[1L]] - 1L)
    } else pick:left.len

    # Translate back to original pool indices since we're working on a 
    # subset in `pool.left`

    pool.seq.left <- pool.seq[left]
    pool.idx <- pool.seq.left[valid]
    val <- pool[[pool.idx[[1L]]]]

    # Record the picked value, and all the disqualifications

    pick.list[[i]] <- val
    picked[pool.idx] <- TRUE

    # Disqualify shorter matches

    to.rem <- vapply(
      seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
    )
    to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
    picked[to.rem.idx] <- TRUE  
  }
  pick.list  
}
Run Code Online (Sandbox Code Playgroud)

以及一个制作有序池的函数(与您的代码完全相同,但返回已排序):

make_pool <- function(size)
  sort(
    unlist(
      lapply(
        seq_len(size), 
        function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) 
  ) ) )
Run Code Online (Sandbox Code Playgroud)

然后,使用max_len3池(用于视觉检查事物的行为符合预期):

pool3 <- make_pool(3)
set.seed(1)
sample01(pool3, 8)
# [1] "001" "1"   "010" "011" "000" ""    ""    ""   
sample01(pool3, 8)
# [1] "110" "111" "011" "10"  "00"  ""    ""    ""   
sample01(pool3, 8)
# [1] "000" "01"  "11"  "10"  "001" ""    ""    ""   
sample01(pool3, 8)
# [1] "011" "101" "111" "001" "110" "100" "000" "010"    
Run Code Online (Sandbox Code Playgroud)

请注意,在最后一种情况下,我们得到了所有3位数的二进制组合(2 ^ 3),因为我们偶然会从3位数的二进制组合中取样.此外,只有3个大小的游泳池,有许多样本可以防止完整的8次抽奖; 您可以通过消除阻止从池中完全抽取的组合的建议来解决这个问题.

这很快.查看max_len==9使用备用解决方案花费2秒的示例:

pool9 <- make_pool(9)
microbenchmark(sample01(pool9, 4))
# Unit: microseconds
#                expr     min      lq  median      uq     max neval
#  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100    
Run Code Online (Sandbox Code Playgroud)

大约半毫秒.您也可以合理地尝试相当大的池:

pool16 <- make_pool(16)  # 131K entries
system.time(sample01(pool16, 100))
#  user  system elapsed 
# 3.407   0.146   3.552 
Run Code Online (Sandbox Code Playgroud)

这不是很快,但我们谈论的是一个拥有130K物品的游泳池.还有可能进行额外的优化.

请注意,对于大型池,排序步骤相对较慢,但我不计算它,因为您只需要执行一次,并且您可能想出一个合理的算法来生成预先排序的池.

我在一个现在已删除的答案中探讨了更快的整数到二进制方法的可能性,但这需要更多的工作才能确切地找到你正在寻找的东西.


Fra*_*ank 13

将ID映射到字符串.您可以将数字映射到0/1向量,如@BrodieG所述:

# some key objects

n_pool      = sum(2^(1:max_len))      # total number of indices
cuts        = cumsum(2^(1:max_len-1)) # new group starts
inds_by_g   = mapply(seq,cuts,cuts*2) # indices grouped by length

# the mapping to strings (one among many possibilities)

library(data.table)
get_01str <- function(id,max_len){
    cuts = cumsum(2^(1:max_len-1))
    g    = findInterval(id,cuts)
    gid  = id-cuts[g]+1

    data.table(g,gid)[,s:=
      do.call(paste,c(list(sep=""),lapply(
        seq(g[1]), 
        function(x) (gid-1) %/% 2^(x-1) %% 2
      )))
    ,by=g]$s      
} 
Run Code Online (Sandbox Code Playgroud)

寻找ids下降.我们将从id采样池中顺序删除s:

 # the mapping from one index to indices of nixed strings

get_nixstrs <- function(g,gid,max_len){

    cuts         = cumsum(2^(1:max_len-1))
    gids_child   = {
      x = gid%%2^sequence(g-1)
      ifelse(x,x,2^sequence(g-1))
    }
    ids_child    = gids_child+cuts[sequence(g-1)]-1

    ids_parent   = if (g==max_len) gid+cuts[g]-1 else {

      gids_par       = vector(mode="list",max_len)
      gids_par[[g]]  = gid
      for (gg in seq(g,max_len-1)) 
        gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)

      unlist(mapply(`+`,gids_par,cuts-1))
    }

    c(ids_child,ids_parent)
}
Run Code Online (Sandbox Code Playgroud)

索引按g字符数分组nchar(get_01str(id)).因为索引按照排序g,g=findInterval(id,cuts)是一条更快的路线.

在组的索引g,1 < g < max_len具有规模的一个"子"指数g-1和大小的两个父指数g+1.对于每个子节点,我们采用其子节点直到我们命中g==1; 并且对于每个父节点,我们采用它们的父节点对直到我们命中g==max_len.

就组内的标识符而言,树的结构最简单gid.gid映射到两个父母,gidgid+2^g; 并且反转这个映射可以找到孩子.

采样

drawem <- function(n,max_len){
    cuts        = cumsum(2^(1:max_len-1))
    inds_by_g   = mapply(seq,cuts,cuts*2)

    oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
    okinds = unlist(inds_by_g[oklens])

    mysamp = rep(0,n)
    for (i in 1:n){

        id        = if (length(okinds)==1) okinds else sample(okinds,1)
        g         = findInterval(id,cuts)
        gid       = id-cuts[g]+1
        nixed     = get_nixstrs(g,gid,max_len)

        # print(id); print(okinds); print(nixed)

        mysamp[i] = id
        okinds    = setdiff(okinds,nixed)
        if (!length(okinds)) break
    }

    res <- rep("",n)
    res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
    res
}
Run Code Online (Sandbox Code Playgroud)

oklens部分集成了OP的想法,省略了保证不能进行采样的字符串.然而,即使这样做,我们也可能遵循一条不再有选择的采样路径.考虑的OP的例子max_len=4n=10,我们知道我们必须删除01考虑,但如果我们的第一个四平局都发生了什么00,01,1110?哦,我想我们运气不好.这就是您应该实际定义采样概率的原因.(OP有另一个想法,在每个步骤确定哪些节点将导致一个不可能的状态,但这看起来很高.)

插图

# how the indices line up

n_pool = sum(2^(1:max_len)) 
pdt <- data.table(id=1:n_pool)
pdt[,g:=findInterval(id,cuts)]
pdt[,gid:=1:.N,by=g]
pdt[,s:=get_01str(id,max_len)]

# example run

set.seed(4); drawem(5,5)
# [1] "01100" "1"     "0001"  "0101"  "00101"

set.seed(4); drawem(8,4)
# [1] "1100" "0"    "111"  "101"  "1101" "100"  ""     ""  
Run Code Online (Sandbox Code Playgroud)

基准(比@ BrodieG的回答更早)

require(rbenchmark)
max_len = 8
n = 8

benchmark(
      jos_lp     = {
        pool <- unlist(lapply(seq_len(max_len),
          function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
        sample.lp(pool, n)},
      bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
      fra_num    = drawem(n,max_len),
      replications=5)[1:5]
#         test replications elapsed relative user.self
# 2 bro_string            5    0.05      2.5      0.05
# 3    fra_num            5    0.02      1.0      0.02
# 1     jos_lp            5    1.56     78.0      1.55

n = 12
max_len = 12
benchmark(
  bro_string={pool <- make_pool(max_len);sample01(pool,n)},
  fra_num=drawem(n,max_len),
  replications=5)[1:5]
#         test replications elapsed relative user.self
# 1 bro_string            5    0.54     6.75      0.51
# 2    fra_num            5    0.08     1.00      0.08
Run Code Online (Sandbox Code Playgroud)

其他答案.还有另外两个答案:

jos_enum = {pool <- unlist(lapply(seq_len(max_len), 
    function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
  get.template(pool, n)}
bro_num  = sample011(max_len,n)    
Run Code Online (Sandbox Code Playgroud)

我遗漏了@ josilber的枚举方法,因为它花了太长时间; 和@ BrodieG的数字/索引方法,因为它当时不起作用,但现在确实如此.有关更多基准测试,请参阅@ BrodieG的更新答案.

速度与正确性.虽然@ josilber的答案要慢得多(而且对于枚举方法来说,显然会占用大量内存),但他们保证会n在第一次尝试时绘制一个大小的样本.使用@ BrodieG的字符串方法或这个答案,您将不得不一次又一次地重新采样,希望能够完整填充n.max_len我想,随着大,这不应该是这样的问题.

这个答案比bro_string不需要预先构建更好地扩展pool.


swe*_*zel 13

它是在python而不是r,但jbaums说它会没事的.

所以这是我的贡献,请参阅源代码中的注释以解释关键部分.
我仍在研究分析解决方案,以确定c深度树tS样本树的可能组合数量,因此我可以改进功能combs.也许有人拥有它?这真的是现在的瓶颈.

从深度为16的树中采样100个节点在我的笔记本电脑上花费大约8ms.这不是第一次,但是由于combBuffer被填充,所以你采样得越多,它就越快到某一点.

import random


class Tree(object):
    """
    :param level: The distance of this node from the root.
    :type level: int
    :param parent: This trees parent node
    :type parent: Tree
    :param isleft: Determines if this is a left or a right child node. Can be
                   omitted if this is the root node.
    :type isleft: bool

    A binary tree representing possible strings which match r'[01]{1,n}'. Its
    purpose is to be able to sample n of its nodes where none of the sampled
    nodes' ids is a prefix for another one.
    It is possible to change Tree.maxdepth and then reuse the root. All
    children are created ON DEMAND, which means everything is lazily evaluated.
    If the Tree gets too big anyway, you can call 'prune' on any node to delete
    its children.

        >>> t = Tree()
        >>> t.sample(8, toString=True, depth=3)
        ['111', '110', '101', '100', '011', '010', '001', '000']
        >>> Tree.maxdepth = 2
        >>> t.sample(4, toString=True)
        ['11', '10', '01', '00']
    """

    maxdepth = 10
    _combBuffer = {}

    def __init__(self, level=0, parent=None, isleft=None):
        self.parent = parent
        self.level = level
        self.isleft = isleft
        self._left = None
        self._right = None

    @classmethod
    def setMaxdepth(cls, depth):
        """
        :param depth: The new depth
        :type depth: int

        Sets the maxdepth of the Trees. This basically is the depth of the root
        node.
        """
        if cls.maxdepth == depth:
            return

        cls.maxdepth = depth

    @property
    def left(self):
        """This tree's left child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._left is None:
            self._left = Tree(self.level+1, self, True)
        return self._left

    @property
    def right(self):
        """This tree's right child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._right is None:
            self._right = Tree(self.level+1, self, False)
        return self._right

    @property
    def depth(self):
        """
        This tree's depth. (maxdepth-level)
        """
        return self.maxdepth-self.level

    @property
    def id(self):
        """
        This tree's id, string of '0's and '1's equal to the path from the root
        to this subtree. Where '1' means going left and '0' means going right.
        """
        # level 0 is the root node, it has no id
        if self.level == 0:
            return ''
        # This takes at most Tree.maxdepth recursions. Therefore
        # it is save to do it this way. We could also save each nodes
        # id once it is created to avoid recreating it every time, however
        # this won't save much time but use quite some space.
        return self.parent.id + ('1' if self.isleft else '0')

    @property
    def leaves(self):
        """
        The amount of leave nodes, this tree has. (2**depth)
        """
        return 2**self.depth

    def __str__(self):
        return self.id

    def __len__(self):
        return 2*self.leaves-1

    def prune(self):
        """
        Recursively prune this tree's children.
        """
        if self._left is not None:
            self._left.prune()
            self._left.parent = None
            self._left = None

        if self._right is not None:
            self._right.prune()
            self._right.parent = None
            self._right = None

    def combs(self, n):
        """
        :param n: The amount of samples to be taken from this tree
        :type n: int
        :returns: The amount of possible combinations to choose n samples from
                  this tree

        Determines recursively the amount of combinations of n nodes to be
        sampled from this tree.
        Subsequent calls with same n on trees with same depth will return the
        result from the previous computation rather than computing it again.

            >>> t = Tree()
            >>> Tree.maxdepth = 4
            >>> t.combs(16)
            1
            >>> Tree.maxdepth = 3
            >>> t.combs(6)
            58
        """

        # important for the amount of combinations is only n and the depth of
        # this tree
        key = (self.depth, n)

        # We use the dict to save computation time. Calling the function with
        # equal values on equal nodes just returns the alrady computed value if
        # possible.
        if key not in Tree._combBuffer:
            leaves = self.leaves

            if n < 0:
                N = 0
            elif n == 0 or self.depth == 0 or n == leaves:
                N = 1
            elif n == 1:
                return (2*leaves-1)
            else:
                if n > leaves/2:
                    # if n > leaves/2, at least n-leaves/2 have to stay on
                    # either side, otherweise the other one would have to
                    # sample more nodes than possible.
                    nMin = n-leaves/2
                else:
                    nMin = 0

                # The rest n-2*nMin is the amount of samples that are free to
                # fall on either side
                free = n-2*nMin

                N = 0
                # sum up the combinations of all possible splits
                for addLeft in range(0, free+1):
                    nLeft = nMin + addLeft
                    nRight = n - nLeft
                    N += self.left.combs(nLeft)*self.right.combs(nRight)

            Tree._combBuffer[key] = N
            return N
        return Tree._combBuffer[key]

    def sample(self, n, toString=False, depth=None):
        """
        :param n: How may samples to take from this tree
        :type n: int
        :param toString: If 'True' result will direclty be turned into a list
                         of strings
        :type toString: bool
        :param depth: If not None, will overwrite Tree.maxdepth
        :type depth: int or None
        :returns: List of n nodes sampled from this tree
        :throws ValueError: when n is invalid

        Takes n random samples from this tree where none of the sample's ids is
        a prefix for another one's.

        For an example see Tree's docstring.
        """
        if depth is not None:
            Tree.setMaxdepth(depth)

        if toString:
            return [str(e) for e in self.sample(n)]

        if n < 0:
            raise ValueError('Negative sample size is not possible!')

        if n == 0:
            return []

        leaves = self.leaves
        if n > leaves:
            raise ValueError(('Cannot sample {} nodes, with only {} ' +
                              'leaves!').format(n, leaves))

        # Only one sample to choose, that is nice! We are free to take any node
        # from this tree, including this very node itself.
        if n == 1 and self.level > 0:
            # This tree has 2*leaves-1 nodes, therefore
            # the probability that we keep the root node has to be
            # 1/(2*leaves-1) = P_root. Lets create a random number from the
            # interval [0, 2*leaves-1).
            # It will be 0 with probability 1/(2*leaves-1)
            P_root = random.randint(0, len(self)-1)
            if P_root == 0:
                return [self]
            else:
                # The probability to land here is 1-P_root

                # A child tree's size is (leaves-1) and since it obeys the same
                # rule as above, the probability for each of its nodes to
                # 'survive' is 1/(leaves-1) = P_child.
                # However all nodes must have equal probability, therefore to
                # make sure that their probability is also P_root we multiply
                # them by 1/2*(1-P_root). The latter is already done, the
                # former will be achieved by the next condition.
                # If we do everything right, this should hold:
                # 1/2 * (1-P_root) * P_child = P_root

                # Lets see...
                # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1)
                # (1-1/(2*leaves-1)) * (1/(2*(leaves-1)))
                # (1-1/(2*leaves-1)) * (1/(2*leaves-2))
                # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-2)/((2*leaves-2) * (2*leaves-1))
                # 1/(2*leaves-1)
                # There we go!
                if random.random() < 0.5:
                    return self.right.sample(1)
                else:
                    return self.left.sample(1)

        # Now comes the tricky part... n > 1 therefore we are NOT going to
        # sample this node. Its probability to be chosen is 0!
        # It HAS to be 0 since we are definitely sampling from one of its
        # children which means that this node will be blocked by those samples.
        # The difficult part now is to prove that the sampling the way we do it
        # is really random.

        if n > leaves/2:
            # if n > leaves/2, at least n-leaves/2 have to stay on either
            # side, otherweise the other one would have to sample more
            # nodes than possible.
            nMin = n-leaves/2
        else:
            nMin = 0
        # The rest n-2*nMin is the amount of samples that are free to fall
        # on either side
        free = n-2*nMin

        # Let's have a look at an example, suppose we were to distribute 5
        # samples among two children which have 4 leaves each.
        # Each child has to get at least 1 sample, so the free samples are 3.
        # There are 4 different ways to split the samples among the
        # children (left, right):
        # (1, 4), (2, 3), (3, 2), (4, 1)
        # The amount of unique sample combinations per child are
        # (7, 1), (11, 6), (6, 11), (1, 7)
        # The amount of total unique samples per possible split are
        #   7   ,   66  ,   66  ,    7
        # In case of the first and last split, all samples have a probability
        # of 1/7, this was already proven above.
        # Lets suppose we are good to go and the per sample probabilities for
        # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the
        # overall per sample probabilities for the splits would be:
        #  1/7  ,  1/66 , 1/66 , 1/7
        # If we used uniform random to determine the split, all splits would be
        # equally probable and therefore be multiplied with the same value (1/4)
        # But this would mean that NOT every sample is equally probable!
        # We need to know in advance how many sample combinations there will be
        # for a given split in order to find out the probability to choose it.
        # In fact, due to the restrictions, this becomes very nasty to
        # determine. So instead of solving it analytically, I do it numerically
        # with the method 'combs'. It gives me the amount of possible sample
        # combinations for a certain amount of samples and a given tree depth.
        # It will return 146 for this node and 7 for the outer and 66 for the
        # inner splits.
        # What we now do is, we take a number from [0, 146).
        # if it is smaller than 7, we sample from the first split,
        # if it is smaller than 7+66, we sample from the second split,
        # ...
        # This way we get the probabilities we need.

        r = random.randint(0, self.combs(n)-1)
        p = 0
        for addLeft in xrange(0, free+1):
            nLeft = nMin + addLeft
            nRight = n - nLeft

            p += (self.left.combs(nLeft) * self.right.combs(nRight))
            if r < p:
                return self.left.sample(nLeft) + self.right.sample(nRight)
        assert False, ('Something really strange happend, p did not sum up ' +
                       'to combs or r was too big')


def main():
    """
    Do a microbenchmark.
    """
    import timeit
    i = 1
    main.t = Tree()
    template = ' {:>2}  {:>5} {:>4}  {:<5}'
    print(template.format('i', 'depth', 'n', 'time (ms)'))
    N = 100
    for depth in [4, 8, 15, 16, 17, 18]:
        for n in [10, 50, 100, 150]:
            if n > 2**depth:
                time = '--'
            else:
                time = timeit.timeit(
                    'main.t.sample({}, depth={})'.format(n, depth), setup=
                    'from __main__ import main', number=N)*1000./N
            print(template.format(i, depth, n, time))
            i += 1


if __name__ == "__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

基准输出:

  i  depth    n  time (ms)
  1      4   10  0.182511806488
  2      4   50  --   
  3      4  100  --   
  4      4  150  --   
  5      8   10  0.397620201111
  6      8   50  1.66054964066
  7      8  100  2.90236949921
  8      8  150  3.48146915436
  9     15   10  0.804011821747
 10     15   50  3.7428188324
 11     15  100  7.34910964966
 12     15  150  10.8230614662
 13     16   10  0.804491043091
 14     16   50  3.66818904877
 15     16  100  7.09567070007
 16     16  150  10.404779911
 17     17   10  0.865840911865
 18     17   50  3.9999294281
 19     17  100  7.70257949829
 20     17  150  11.3758206367
 21     18   10  0.915451049805
 22     18   50  4.22935962677
 23     18  100  8.22361946106
 24     18  150  12.2081303596
Run Code Online (Sandbox Code Playgroud)

来自深度为10的树的10个10号样本:

['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']
Run Code Online (Sandbox Code Playgroud)


jos*_*ber 11

如果您不想生成所有可能元组的集合,然后随机进行采样(对于大输入大小,您可能会认为这是不可行的),另一个选项是使用整数编程绘制单个样本.基本上,您可以在pool随机值中分配每个元素,然后选择具有最大值总和的可行元组.这应该给每个元组选择相同的概率,因为它们都是相同的大小,并且它们的值是随机选择的.模型的约束将确保不选择任何不允许的元组对,并且选择正确数量的元素.

这是lpSolve包的解决方案:

library(lpSolve)
sample.lp <- function(pool, max_len) {
  pool <- sort(pool)
  pml <- max(nchar(pool))
  runs <- c(rev(cumsum(2^(seq(pml-1)))), 0)
  banned.from <- rep(seq(pool), runs[nchar(pool)])
  banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len))
  banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool))
  banned.constr[cbind(seq(banned.from), banned.from)] <- 1
  banned.constr[cbind(seq(banned.to), banned.to)] <- 1
  mod <- lp(direction="max",
            objective.in=runif(length(pool)),
            const.mat=rbind(banned.constr, rep(1, length(pool))),
            const.dir=c(rep("<=", length(banned.from)), "=="),
            const.rhs=c(rep(1, length(banned.from)), max_len),
            all.bin=TRUE)
  pool[which(mod$solution == 1)]
}
set.seed(144)
pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
sample.lp(pool, 4)
# [1] "0011" "010"  "1000" "1100"
sample.lp(pool, 8)
# [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110"
Run Code Online (Sandbox Code Playgroud)

这似乎可以扩展到相当大的池.例如,从510大小的池中获取长度为20的样本需要2秒多一点:

pool <- unlist(lapply(seq_len(8), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
length(pool)
# [1] 510
system.time(sample.lp(pool, 20))
#    user  system elapsed 
#   0.232   0.008   0.239 
Run Code Online (Sandbox Code Playgroud)

如果您真的需要解决非常大的问题规模,那么您可以从非开源解决方案lpSolve转移到商业解决方案,如gurobi或cplex(一般不免费,但学术用途免费).


jos*_*ber 9

一种方法是使用迭代方法简单地生成适当大小的所有可能元组:

  1. 构建大小为1的所有元组(所有元素都在pool)
  2. 带有元素的十字架产品 pool
  3. 使用pool多次相同的元素删除任何元组
  4. 删除另一个元组的任何完全重复
  5. 删除任何不能一起使用的元组
  6. 冲洗并重复,直到获得适当的元组大小

这对于给定的尺寸(pool长度为30,4)是可运行的max_len:

get.template <- function(pool, max_len) {
  banned <- which(outer(paste0('^', pool), pool, Vectorize(grepl)), arr.ind=T)
  banned <- banned[banned[,1] != banned[,2],]
  banned <- paste(banned[,1], banned[,2])
  vals <- matrix(seq(length(pool)))
  for (k in 2:max_len) {
    vals <- cbind(vals[rep(1:nrow(vals), each=length(pool)),],
                  rep(1:length(pool), nrow(vals)))
    # Can't sample same value more than once
    vals <- vals[apply(vals, 1, function(x) length(unique(x)) == length(x)),]
    # Sort rows to ensure unique only
    vals <- t(apply(vals, 1, sort))
    vals <- unique(vals)
    # Can't have banned pair
    combos <- combn(ncol(vals), 2)
    for (k in seq(ncol(combos))) {
        c1 <- combos[1,k]
        c2 <- combos[2,k]
        vals <- vals[!paste(vals[,c1], vals[,c2]) %in% banned,]
    }
  }
  return(matrix(pool[vals], nrow=nrow(vals)))
}

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
system.time(template <- get.template(pool, 4))
#   user  system elapsed 
#  4.549   0.050   4.614 
Run Code Online (Sandbox Code Playgroud)

现在,您可以根据需要从行template(这将非常快)中抽样多次,这与从定义的空间中随机抽样相同.


Jim*_*ood 9

介绍

我发现这个问题非常有趣,我觉得有必要考虑一下,并最终提供我自己的答案.由于我到达的算法没有立即跟随问题描述,我将首先解释我是如何得到这个解决方案,然后在C++中提供一个示例实现(我从未写过R).

开发解决方案

乍一看

阅读问题描述最初令人困惑,但是当我看到编辑树的图片时,我立即理解了问题,我的直觉表明二叉树也是一个解决方案:构建一个树(一个树的集合)大小1),并在做出选择后消除树枝和祖先后将树分成一组较小的树.

虽然这最初似乎很好,但选择过程和维护收集将是一个痛苦.不过,树似乎应该在任何解决方案中发挥重要作用.

修订版1

不要打破树.而是在每个节点处具有布尔数据有效载荷,指示它是否已被消除.这只留下一棵维持形式的树.

但请注意,这不仅仅是任何二叉树,它实际上是一个完整的深度二叉树max_len-1.

修订版2

完整的二叉树可以很好地表示为数组.典型的数组表示使用树的广度优先搜索,具有以下属性:

Let x be the array index.
x = 0 is the root of the entire tree
left_child(x) = 2x + 1
right_child(x) = 2x + 2
parent(x) = floor((n-1)/2)
Run Code Online (Sandbox Code Playgroud)

在下图中,每个节点都标有其数组索引: 广度优先

作为一个数组,这占用更少的内存(没有更多的指针),使用的内存是连续的(适用于缓存),并且可以完全在堆栈而不是堆(假设您的语言为您提供选择).当然,这里适用一些条件,特别是阵列的大小.我稍后再说.

与修订版1中一样,存储在数组中的数据将是布尔值:true表示可用,false表示不可用.由于根节点实际上不是有效选择,因此应将索引0初始化为false.如何进行选择的问题仍然存在:

由于指标被排除在外,因此要记录有多少被消除,以及有多少被消除是微不足道的.选择该范围内的随机数,然后遍历数组,直到看到许多指标设置为true(包括当前索引).到达的指数是选择.选择,直到选择了n个指示,或者没有任何内容可供选择.

这是一个完整的算法,但是在选择过程中仍有改进的余地,并且还存在未解决的大小的实际问题:数组大小为O(2 ^ n).随着n变大,首先缓存优势消失,然后数据开始被分页到磁盘,并且在某些时候根本不可能存储.

修订版3

我决定首先解决更容易的问题:改进选择过程.

从左到右扫描阵列是浪费的.跟踪已被消除的范围而不是连续检查和发现几个故障可能更有效.然而,我们的树表示对于此并不理想,因为在每一轮中将被消除的节点很少在阵列中是连续的.

通过重新排列数组映射到树的方式,可以更好地利用它.特别是,让我们使用预订深度优先搜索而不是广度优先搜索.为了做到这一点,树需要固定大小,这就是这个问题的情况.子节点和父节点的索引如何以数学方式连接也不太明显.

深度优先

通过使用这种安排,每个不是叶子的选择都可以保证消除连续的范围:它的子树.

修订版4

通过跟踪消除的范围,不再需要真/假数据,因此根本不需要阵列或树.在每次随机抽取时,消除的范围可用于快速找到要选择的节点.所有祖先和整个子树都被删除,可以表示为可以轻松与其他子项合并的范围.

最后的任务是将所选节点转换为OP所需的字符串表示.这很简单,因为这个二叉树仍然保持严格的顺序:从根遍历,所有元素> =右边的孩子在右边,其他元素在左边.因此,搜索树将通过在向左移动时附加'0'来提供祖先列表和二进制字符串; 或者在向右移动时为"1".

样本实施

#include <stdint.h>
#include <algorithm>
#include <cmath>
#include <list>
#include <deque>
#include <ctime>
#include <cstdlib>
#include <iostream>

/*
 * A range of values of the form (a, b), where a <= b, and is inclusive.
 * Ex (1,1) is the range from 1 to 1 (ie: just 1)
 */
class Range
{
private:
    friend bool operator< (const Range& lhs, const Range& rhs);
    friend std::ostream& operator<<(std::ostream& os, const Range& obj);

    int64_t m_start;
    int64_t m_end;

public:
    Range(int64_t start, int64_t end) : m_start(start), m_end(end) {}
    int64_t getStart() const { return m_start; }
    int64_t getEnd() const { return m_end; }
    int64_t size() const { return m_end - m_start + 1; }
    bool canMerge(const Range& other) const {
        return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1));
    }
    int64_t merge(const Range& other) {
        int64_t change = 0;
        if (m_start > other.m_start) {
            change += m_start - other.m_start;
            m_start = other.m_start;
        }
        if (other.m_end > m_end) {
            change += other.m_end - m_end;
            m_end = other.m_end;
        }
        return change;
    }
};

inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;}
std::ostream& operator<<(std::ostream& os, const Range& obj) {
    os << '(' << obj.m_start << ',' << obj.m_end << ')';
    return os;
}

/*
 * Stuct to allow returning of multiple values
 */
struct NodeInfo {
    int64_t subTreeSize;
    int64_t depth;
    std::list<int64_t> ancestors;
    std::string representation;
};

/*
 * Collection of functions representing a complete binary tree
 * as an array created using pre-order depth-first search,
 * with 0 as the root.
 * Depth of the root is defined as 0.
 */
class Tree
{
private:
    int64_t m_depth;
public:
    Tree(int64_t depth) : m_depth(depth) {}
    int64_t size() const {
        return (int64_t(1) << (m_depth+1))-1;
    }
    int64_t getDepthOf(int64_t node) const{
        if (node == 0) { return 0; }
        int64_t searchDepth = m_depth;
        int64_t currentDepth = 1;
        while (true) {
            int64_t rightChild = int64_t(1) << searchDepth;
            if (node == 1 || node == rightChild) {
                break;
            } else if (node > rightChild) {
                node -= rightChild;
            } else {
                node -= 1;
            }
            currentDepth += 1;
            searchDepth -= 1;
        }
        return currentDepth;
    }
    int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const {
        if (node == 0) {
            return size();
        }
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1;
    }
    int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1;
    }
    int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2);
    }
    NodeInfo getNodeInfo(int64_t node) const {
        NodeInfo info;
        int64_t depth = 0;
        int64_t currentNode = 0;
        while (currentNode != node) {
            if (currentNode != 0) {
                info.ancestors.push_back(currentNode);
            }
            int64_t rightChild = getRightChildOf(currentNode, depth);
            if (rightChild == -1) {
                break;
            } else if (node >= rightChild) {
                info.representation += '1';
                currentNode = rightChild;
            } else {
                info.representation += '0';
                currentNode = getLeftChildOf(currentNode, depth);
            }
            depth++;
        }
        info.depth = depth;
        info.subTreeSize = getSubtreeSizeOf(node, depth);
        return info;
    }
};

// random selection amongst remaining allowed nodes
int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator)
{
    std::uniform_int_distribution<> randomDistribution(1, poolSize);
    int64_t selection = randomDistribution(randomGenerator);
    for (auto const& range : eliminationList) {
        if (selection >= range.getStart()) { selection += range.size(); }
        else { break; }
    }
    return selection;
}

// determin how many nodes have been elimintated
int64_t countEliminated(const std::deque<Range>& eliminationList)
{
    int64_t count = 0;
    for (auto const& range : eliminationList) {
        count += range.size();
    }
    return count;
}

// merge all the elimination ranges to listA, and return the number of new elimintations
int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) {
    if(listB.empty()) { return 0; }
    if(listA.empty()) {
        listA.swap(listB);
        return countEliminated(listA);
    }

    int64_t newEliminations = 0;
    int64_t x = 0;
    auto listA_iter = listA.begin();
    auto listB_iter = listB.begin();
    while (listB_iter != listB.end()) {
        if (listA_iter == listA.end()) {
            listA_iter = listA.insert(listA_iter, *listB_iter);
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (listA_iter->canMerge(*listB_iter)) {
            x = listA_iter->merge(*listB_iter);
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (*listB_iter < *listA_iter) {
            listA_iter = listA.insert(listA_iter, *listB_iter) + 1;
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    while (listA_iter != listA.end()) {
        if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    return newEliminations;
}

int main (int argc, char** argv)
{
    std::random_device rd;
    std::mt19937_64 randomGenerator(rd());

    int64_t max_len = std::stoll(argv[1]);
    int64_t num_samples = std::stoll(argv[2]);

    int64_t samplesRemaining = num_samples;
    Tree tree(max_len);
    int64_t poolSize = tree.size() - 1;
    std::deque<Range> eliminationList;
    std::deque<Range> eliminated;
    std::list<std::string> foundList;

    while (samplesRemaining > 0 && poolSize > 0) {
        // find a valid node
        int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator);
        NodeInfo info = tree.getNodeInfo(selectedNode);
        foundList.push_back(info.representation);
        samplesRemaining--;

        // determine which nodes this choice eliminates
        eliminated.clear();
        for( auto const& ancestor : info.ancestors) {
            Range r(ancestor, ancestor);
            if(eliminated.empty() || !eliminated.back().canMerge(r)) {
                eliminated.push_back(r);
            } else {
                eliminated.back().merge(r);
            }
        }
        Range r(selectedNode, selectedNode + info.subTreeSize - 1);
        if(eliminated.empty() || !eliminated.back().canMerge(r)) {
            eliminated.push_back(r);
        } else {
            eliminated.back().merge(r);
        }

        // add the eliminated nodes to the existing list
        poolSize -= mergeEliminations(eliminationList, eliminated);
    }

    // Print some stats
    // std::cout << "tree: " << tree.size() << " samplesRemaining: "
    //                       << samplesRemaining << " poolSize: "
    //                       << poolSize << " samples: " << foundList.size()
    //                       << " eliminated: "
    //                       << countEliminated(eliminationList) << std::endl;

    // Print list of binary strings
    // std::cout << "list:";
    // for (auto const& s : foundList) {
    //  std::cout << " " << s;
    // }
    // std::cout << std::endl;
}
Run Code Online (Sandbox Code Playgroud)

额外的想法

对于max_len,该算法的扩展性非常好.使用n进行缩放并不是很好,但基于我自己的分析,它似乎比其他解决方案做得更好.

可以很容易地修改该算法以允许包含多于"0"和"1"的字符串.更可能的字母增加了扇出树的,并且将消除更宽范围与每个选择-仍然在每个子树的所有节点的剩余连续的.