在 R 中高效枚举具有差异约束的所有子集

Noa*_*oah 5 algorithm performance combinations enumeration r

我有一个V长度为 的连续整数向量l,例如1, 2, 3, 4, 5, 6, 7。我想找到大小的所有子集,k使得子集中任意两个数字之间的差值不能小于m,例如2。使用上面的示例l = 7k = 3m = 2,子集是

1, 3, 5
1, 3, 6
1, 3, 7
1, 4, 6
1, 4, 7
1, 5, 7
2, 4, 6
2, 4, 7
2, 5, 7
3, 5, 7
Run Code Online (Sandbox Code Playgroud)

一种方法是枚举所有可能的大小子集k并丢弃任何不满足m约束的子集,但即使解决方案的数量很小,此过程也会爆炸。

我当前的方法是一种暴力算法,其中我从具有最小可能整数的子集开始,将最后一个条目增加 1,直到达到上限,增加前一个条目并将最后一个条目重置为可以的最低值鉴于先前条目的增加。也就是说,我从 开始1, 3, 5,然后将最后一位数字加一以获得1, 3, 61, 3, 7,然后由于达到上限,我将中间加 1(至4)并将最后一位数字设置为可以给定该数字的最小值(到6) 得到1, 4, 6,并照此进行。对于大型的 R 来说,这最终会变得相当慢l,我想知道是否有一个聪明的矢量化解决方案可以立即生成所有组合,这可以通过利用条目的顺序性质来实现。实现这个算法可以Rcpp稍微加快速度,但我仍然希望有一个更优雅的解决方案(如果有)。

Tho*_*ing 4

以下是几个基本 R 选项

  • 递归

我们可以像下面这样定义递归函数

f0 <- function(v, k, m) {
    if (k == 1) {
        return(matrix(v))
    }
    d <- Recall(v, k - 1, m)
    u <- unique(d[, ncol(d)])
    uu <- (u + m)[(u + m) %in% v]
    lst <- list()
    for (i in u) {
        dd <- d[d[, ncol(d)] == i, , drop = FALSE]
        p <- uu[uu - i >= m]
        if (length(p) > 0) {
            lst <- append(
                lst,
                list(cbind(dd[rep(1:nrow(dd), each = length(p)), , drop = FALSE], p))
            )
        }
    }
    unname(do.call(rbind, lst))
}
Run Code Online (Sandbox Code Playgroud)

我们可以得到

> f0(v = 1:7, k = 3, m = 2)
      [,1] [,2] [,3]
 [1,]    1    3    5
 [2,]    1    3    6
 [3,]    1    3    7
 [4,]    1    4    6
 [5,]    1    4    7
 [6,]    2    4    6
 [7,]    2    4    7
 [8,]    1    5    7
 [9,]    2    5    7
[10,]    3    5    7

> f0(v = 1:10, k = 3, m = 2)
      [,1] [,2] [,3]
 [1,]    1    3    5
 [2,]    1    3    6
 [3,]    1    3    7
 [4,]    1    3    8
 [5,]    1    3    9
 [6,]    1    3   10
 [7,]    1    4    6
 [8,]    1    4    7
 [9,]    1    4    8
[10,]    1    4    9
[11,]    1    4   10
[12,]    2    4    6
[13,]    2    4    7
[14,]    2    4    8
[15,]    2    4    9
[16,]    2    4   10
[17,]    1    5    7
[18,]    1    5    8
[19,]    1    5    9
[20,]    1    5   10
[21,]    2    5    7
[22,]    2    5    8
[23,]    2    5    9
[24,]    2    5   10
[25,]    3    5    7
[26,]    3    5    8
[27,]    3    5    9
[28,]    3    5   10
[29,]    1    6    8
[30,]    1    6    9
[31,]    1    6   10
[32,]    2    6    8
[33,]    2    6    9
[34,]    2    6   10
[35,]    3    6    8
[36,]    3    6    9
[37,]    3    6   10
[38,]    4    6    8
[39,]    4    6    9
[40,]    4    6   10
[41,]    1    7    9
[42,]    1    7   10
[43,]    2    7    9
[44,]    2    7   10
[45,]    3    7    9
[46,]    3    7   10
[47,]    4    7    9
[48,]    4    7   10
[49,]    5    7    9
[50,]    5    7   10
[51,]    1    8   10
[52,]    2    8   10
[53,]    3    8   10
[54,]    4    8   10
[55,]    5    8   10
[56,]    6    8   10
Run Code Online (Sandbox Code Playgroud)
  • for-循环方法

您可以简单地使用for如下循环运行

f1 <- function(v, k, m) {
    res <- matrix(v)
    p <- v
    for (i in 1:(k - 1)) {
        q <- (p + m)[(p + m) %in% v]
        lst <- list()
        for (j in 1:nrow(res)) {
            s <- q[q - res[j, i] >= m]
            if (length(s) > 0) {
                lst <- append(
                    lst,
                    list(cbind(res[rep(j, each = length(s)), , drop = FALSE], s))
                )
            }
        }
        res <- unname(do.call(rbind, lst))
        p <- q
    }
    res
}
Run Code Online (Sandbox Code Playgroud)

并将获得

> f1(v = 1:7, k = 3, m = 2)
      [,1] [,2] [,3]
 [1,]    1    3    5
 [2,]    1    3    6
 [3,]    1    3    7
 [4,]    1    4    6
 [5,]    1    4    7
 [6,]    2    4    6
 [7,]    2    4    7
 [8,]    1    5    7
 [9,]    2    5    7
[10,]    3    5    7

> f1(v = 1:10, k = 3, m = 2)
      [,1] [,2] [,3]
 [1,]    1    3    5
 [2,]    1    3    6
 [3,]    1    3    7
 [4,]    1    3    8
 [5,]    1    3    9
 [6,]    1    3   10
 [7,]    1    4    6
 [8,]    1    4    7
 [9,]    1    4    8
[10,]    1    4    9
[11,]    1    4   10
[12,]    2    4    6
[13,]    2    4    7
[14,]    2    4    8
[15,]    2    4    9
[16,]    2    4   10
[17,]    1    5    7
[18,]    1    5    8
[19,]    1    5    9
[20,]    1    5   10
[21,]    2    5    7
[22,]    2    5    8
[23,]    2    5    9
[24,]    2    5   10
[25,]    3    5    7
[26,]    3    5    8
[27,]    3    5    9
[28,]    3    5   10
[29,]    1    6    8
[30,]    1    6    9
[31,]    1    6   10
[32,]    2    6    8
[33,]    2    6    9
[34,]    2    6   10
[35,]    3    6    8
[36,]    3    6    9
[37,]    3    6   10
[38,]    4    6    8
[39,]    4    6    9
[40,]    4    6   10
[41,]    1    7    9
[42,]    1    7   10
[43,]    2    7    9
[44,]    2    7   10
[45,]    3    7    9
[46,]    3    7   10
[47,]    4    7    9
[48,]    4    7   10
[49,]    5    7    9
[50,]    5    7   10
[51,]    1    8   10
[52,]    2    8   10
[53,]    3    8   10
[54,]    4    8   10
[55,]    5    8   10
[56,]    6    8   10
Run Code Online (Sandbox Code Playgroud)
  • Reduce方法

另一个选项是使用,它通过添加符合条件的元素作为新列来Reduce迭代地增加结果的维度。data.framev

f2 <- function(v, k, m) {
    helper <- function(df, v) {
        u <- unique(df[[length(df)]])
        v <- (u + m)[(u + m) %in% v]
        grp <- split(df, df[length(df)])
        lst <- lapply(
            grp,
            \(x) {
                p <- v[v - x[[length(x)]][1] >= 2]
                if (length(p) > 0) {
                    cbind(x[rep(1:nrow(x), each = length(p)), ], p)
                }
            }
        )
        as.data.frame(`row.names<-`(unname(do.call(rbind, lst)), NULL))
    }
    Reduce(helper, rep(list(v), k - 1), init = as.data.frame(v))
}
Run Code Online (Sandbox Code Playgroud)

你将获得

> f2(v = 1:7, k = 3, m = 2)
        
1  1 3 5
2  1 3 6
3  1 3 7
4  1 4 6
5  1 4 7
6  2 4 6
7  2 4 7
8  1 5 7
9  2 5 7
10 3 5 7

> f2(v = 1:10, k = 3, m = 2)

1  1 3  5
2  1 3  6
3  1 3  7
4  1 3  8
5  1 3  9
6  1 3 10
7  1 4  6
8  1 4  7
9  1 4  8
10 1 4  9
11 1 4 10
12 2 4  6
13 2 4  7
14 2 4  8
15 2 4  9
16 2 4 10
17 1 5  7
18 1 5  8
19 1 5  9
20 1 5 10
21 2 5  7
22 2 5  8
23 2 5  9
24 2 5 10
25 3 5  7
26 3 5  8
27 3 5  9
28 3 5 10
29 1 6  8
30 1 6  9
31 1 6 10
32 2 6  8
33 2 6  9
34 2 6 10
35 3 6  8
36 3 6  9
37 3 6 10
38 4 6  8
39 4 6  9
40 4 6 10
41 1 7  9
42 1 7 10
43 2 7  9
44 2 7 10
45 3 7  9
46 3 7 10
47 4 7  9
48 4 7 10
49 5 7  9
50 5 7 10
51 1 8 10
52 2 8 10
53 3 8 10
54 4 8 10
55 5 8 10
56 6 8 10
Run Code Online (Sandbox Code Playgroud)

标杆管理

该基准包括该问题的所有现有答案

v <- 1:20
k <- 4
m <- 2
microbenchmark(
    f_AC = f(v, k, m),    # Allan Cameron's solution
    f_TIC0 = f0(v, k, m), # ThomasIsCoding's solution 0
    f_TIC1 = f1(v, k, m), # ThomasIsCoding's solution 1
    f_TIC2 = f2(v, k, m), # ThomasIsCoding's solution 2
    times = 20L,
    unit = "relative"
)
Run Code Online (Sandbox Code Playgroud)

我们看到

Unit: relative
   expr       min       lq     mean    median        uq      max neval
   f_AC 82.677137 69.78411 43.01843 83.497758 81.922518 6.791794    20
 f_TIC0  1.000000  1.00000  1.00000  1.000000  1.000000 1.000000    20
 f_TIC1  7.731772  7.74679  4.75455  8.012004  7.592652 1.316215    20
 f_TIC2 19.764325 16.68923 11.65485 17.963988 24.911318 2.359821    20
Run Code Online (Sandbox Code Playgroud)

v <- 1:100对于、k <- 5和的压力测试,递归(@Allan Cameron)和(@ThomasIsCodingm <- 2 )的性能如下所示ff0

> system.time(
+     res <- f0(v = 1:100, k = 5, m = 2)
+ )
   user  system elapsed 
   3.33    1.99    5.39

> system.time(
+     res <- f(v = 1:100, k = 5, m = 2)
+ )
   user  system elapsed 
 146.70    4.17  157.02
Run Code Online (Sandbox Code Playgroud)