JSP*_*JSP 15 random r sample distance
我想随机化/洗牌一个向量。一些向量元素是相同的。混洗后,相同元素的最小距离应为 3(即两个其他元素应位于相同元素之间)。
考虑以下 R 中的示例向量:
x <- rep(LETTERS[1:5], 3) # Create example vector
x
# [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
Run Code Online (Sandbox Code Playgroud)
如果我使用示例函数对向量进行混洗,则某些相同的元素可能会靠得太近。例如,如果我使用以下 R 代码,元素“C”会直接出现在位置 5 和 6 处:
set.seed(53135)
sample(x) # sample() function puts same elements too close
# [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
Run Code Online (Sandbox Code Playgroud)
如何确保相同元素的最小距离为三个?
所以基本上我们需要有条件地从 x 向量中采样一个没有在min.dist-1运行中选择的元素。使用 purrr 的 reduce 我们可以做到这一点:
min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
Run Code Online (Sandbox Code Playgroud)
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Run Code Online (Sandbox Code Playgroud)
捆绑在一个函数中
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
Run Code Online (Sandbox Code Playgroud)
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
Run Code Online (Sandbox Code Playgroud)
之后@ 27 9?评论:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E
3 3 3 3 3
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Run Code Online (Sandbox Code Playgroud)
经过一些试验和错误,看看事实并非总是你会有足够的元素来隔开min.dist我想出了一个解决方案,这段代码是从上面的代码中得到最多解释的:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else x[which(freq.got)[1]] )
}, .init=sample(x,1))
}
Run Code Online (Sandbox Code Playgroud)
现在运行table(shuffle(rep(LETTERS[1:5], 3),2))将始终为所有变量返回 3,我们可以肯定地说,在向量中变量的最小距离为 2。保证没有元素重复的唯一方法是使用,min.dist=length(unique(x))-1否则会有实例其中最大r < min.dist元素min.dist与其上次出现的元素没有距离,如果存在此类元素,它们将length(x) + 1 - 1:min.dist位于结果向量的子集中。
为了完全确定使用循环来检查输出向量的尾部是否具有唯一值:(删除我仅用于演示目的的打印语句)
shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E
3 3 3 3 3
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E
3 3 3 3 3
Run Code Online (Sandbox Code Playgroud)
shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}
Run Code Online (Sandbox Code Playgroud)
这个新版本对向量尾部的元素是否进行了严格的测试min.dist,以前的版本适用于min.dist=2,但是这个新版本的测试更好。
如果您的数据很大,那么依靠概率来完成这类任务可能会(方式)更快。
下面是一个例子:
prob_shuffler = function(x, min.dist = 2){
n = length(x)
res = sample(x)
OK = FALSE
# We loop until we have a solution
while(!OK){
OK = TRUE
for(i in 1:min.dist){
# We check if identical elements are 'i' steps away
pblm = res[1:(n-i)] == res[-(1:i)]
if(any(pblm)){
if(sum(pblm) >= (n - i)/2){
# back to square 1
res = sample(x)
} else {
# we pair each identical element with
# an extra one
extra = sample(which(!pblm), sum(pblm))
id_reshuffle = c(which(pblm), extra)
res[id_reshuffle] = sample(res[id_reshuffle])
}
# We recheck from the beginning
OK = FALSE
break
}
}
}
res
}
Run Code Online (Sandbox Code Playgroud)
尽管while循环看起来很可怕,但实际上收敛速度很快。当然,min.dist远离两个字符的概率越低,收敛越快。
@Abdessabour Mtk 和 @Carles Sans Fuentes 当前的解决方案有效,但根据输入数据的大小,很快就会变得非常慢。这是一个基准:
library(microbenchmark)
x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520
microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 1) 87.001 111.501 155.071 131.801 192.401 264.401 10
#> shuffler_am(x, 1) 17218.100 18041.900 20324.301 18740.351 22296.301 26495.200 10
#> shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701 10
microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 2) 140.1 195.201 236.3312 245.252 263.202 354.101 10
#> shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800 29133.400 10
#> shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401 10
microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 3) 318.001 450.402 631.5312 573.352 782.2 1070.401 10
#> shuffler_am(x, 3) 19003.501 19622.300 23314.4808 20784.551 28281.5 32885.101 10
#> shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901 10
Run Code Online (Sandbox Code Playgroud)
我们可以评论两件事:a)在所有逻辑中,速度prob_shuffler取决于 ,min.dist而其他方法不那么重要,b)prob_shuffler仅 520 个观察(并且它可扩展)大约快 100 倍。
当然,如果同时出现两个相同字符的概率min.dist极高,那么递归方法应该更快。但在大多数实际情况下,概率方法更快。
| 归档时间: |
|
| 查看次数: |
346 次 |
| 最近记录: |