R 中的 Shuffle 向量,但相同元素应该有最小距离

JSP*_*JSP 15 random r sample distance

我想随机化/洗牌一个向量。一些向量元素是相同的。混洗后,相同元素的最小距离应为 3(即两个其他元素应位于相同元素之间)。

考虑以下 R 中的示例向量:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
Run Code Online (Sandbox Code Playgroud)

如果我使用示例函数对向量进行混洗,则某些相同的元素可能会靠得太近。例如,如果我使用以下 R 代码,元素“C”会直接出现在位置 5 和 6 处:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
Run Code Online (Sandbox Code Playgroud)

如何确保相同元素的最小距离为三个?

Abd*_*Mtk 6

所以基本上我们需要有条件地从 x 向量中采样一个没有在min.dist-1运行中选择的元素。使用 purrr 的 reduce 我们可以做到这一点:

min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
Run Code Online (Sandbox Code Playgroud)
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Run Code Online (Sandbox Code Playgroud)

捆绑在一个函数中

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
Run Code Online (Sandbox Code Playgroud)
> shuffle(x, 3)
 [1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
 [1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
 [1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
 [1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
Run Code Online (Sandbox Code Playgroud)

之后@ 27 9?评论:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))

A B C D E 
3 3 3 3 3 
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument
Run Code Online (Sandbox Code Playgroud)

更新

经过一些试验和错误,看看事实并非总是你会有足够的元素来隔开min.dist我想出了一个解决方案,这段代码是从上面的代码中得到最多解释的:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), function(.x, ...){
        # whether the value is in the tail of the aggregated vector
        in.tail <- x %in% tail(.x, min.dist)
        # whether a value still hasn't reached the max frequency
        freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
        # whether a value isn't in the aggregated vector
        yet <- !x %in% .x
        # the if is there basically to account for the cases when we don't have enough vars to space out the vectors
         c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
    }, .init=sample(x,1))
}
Run Code Online (Sandbox Code Playgroud)

现在运行table(shuffle(rep(LETTERS[1:5], 3),2))将始终为所有变量返回 3,我们可以肯定地说,在向量中变量的最小距离为 2。保证没有元素重复的唯一方法是使用,min.dist=length(unique(x))-1否则会有实例其中最大r < min.dist元素min.dist与其上次出现的元素没有距离,如果存在此类元素,它们将length(x) + 1 - 1:min.dist位于结果向量的子集中。

为了完全确定使用循环来检查输出向量的尾部是否具有唯一值:(删除我仅用于演示目的的打印语句)

shuffler <- function(x, min.dist=2){
    while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
    l
}

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
 [1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"

A B C D E 
3 3 3 3 3 

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
 [1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"

A B C D E 
3 3 3 3 3 
Run Code Online (Sandbox Code Playgroud)

更新:

shuffler <- function(x, min.dist=2){
    while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
    l
}
Run Code Online (Sandbox Code Playgroud)

这个新版本对向量尾部的元素是否进行了严格的测试min.dist,以前的版本适用于min.dist=2,但是这个新版本的测试更好。


Lau*_*rgé 6

如果您的数据很大,那么依靠概率来完成这类任务可能会(方式)更快。

下面是一个例子:

prob_shuffler = function(x, min.dist = 2){
    n = length(x)
    res = sample(x)
    OK = FALSE
    
    # We loop until we have a solution
    while(!OK){
        OK = TRUE
        for(i in 1:min.dist){
            # We check if identical elements are 'i' steps away
            pblm = res[1:(n-i)] == res[-(1:i)]
            if(any(pblm)){
                if(sum(pblm) >= (n - i)/2){
                    # back to square 1
                    res = sample(x)
                } else {
                    # we pair each identical element with 
                    # an extra one
                    extra = sample(which(!pblm), sum(pblm))
                    id_reshuffle = c(which(pblm), extra)
                    res[id_reshuffle] = sample(res[id_reshuffle])
                }

                # We recheck from the beginning
                OK = FALSE
                break
            }
        }
    }

    res
}

Run Code Online (Sandbox Code Playgroud)

尽管while循环看起来很可怕,但实际上收敛速度很快。当然,min.dist远离两个字符的概率越低,收敛越快。

@Abdessabour Mtk 和 @Carles Sans Fuentes 当前的解决方案有效,但根据输入数据的大小,很快就会变得非常慢。这是一个基准:

library(microbenchmark)

x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520

microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#>                 expr       min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 1)    87.001   111.501    155.071   131.801   192.401    264.401    10
#>    shuffler_am(x, 1) 17218.100 18041.900  20324.301 18740.351 22296.301  26495.200    10
#>   shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701    10

microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#>                 expr     min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 2)   140.1   195.201   236.3312   245.252   263.202    354.101    10
#>    shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800  29133.400    10
#>   shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401    10

microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#>                 expr       min        lq        mean     median       uq        max neval
#>  prob_shuffler(x, 3)   318.001   450.402    631.5312    573.352    782.2   1070.401    10
#>    shuffler_am(x, 3) 19003.501 19622.300  23314.4808  20784.551  28281.5  32885.101    10
#>   shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901    10
Run Code Online (Sandbox Code Playgroud)

我们可以评论两件事:a)在所有逻辑中,速度prob_shuffler取决于 ,min.dist而其他方法不那么重要,b)prob_shuffler仅 520 个观察(并且它可扩展)大约快 100 倍。

当然,如果同时出现两个相同字符的概率min.dist极高,那么递归方法应该更快。但在大多数实际情况下,概率方法更快。