R:具有/不具有替换的排列和组合以及用于不同/非独特项目/多集的排列和组合

Ran*_*Lai 21 combinations r permutation multiset r-faq

在这个主题中,我试图在这里包括所有常见问题及其答案.我希望这对某人有用.

一般问题:如何rn对象生成对象序列?

  • 组合与排列.
  • 有替换和没有替换.
  • 不同的项目与非独特的项目(多重集合).

总共存在这类2^3=8问题.

[更新]

Josh O'Brien认为这8个问题与十二种方式有关.实际上,"不同"的问题以十二种方式包含在内,而"非独特"的问题则不包括在内.无论如何,将这里的8个问题与12倍的方式进行比较是很有趣的.请参阅注释以获取进一步的读数.

Ran*_*Lai 23

编辑:我已经更新了答案,使用更高效的包arrangements

开始使用 arrangement

安排包含一些有效的发生器和迭代器用于排列和组合.已经证明,它arrangements优于大多数类似的现有包装.可以在这里找到一些基准.

以下是上述问题的答案

# 1) combinations: without replacement: distinct items

combinations(5, 2)

      [,1] [,2]
 [1,]    1    2
 [2,]    1    3
 [3,]    1    4
 [4,]    1    5
 [5,]    2    3
 [6,]    2    4
 [7,]    2    5
 [8,]    3    4
 [9,]    3    5
[10,]    4    5


# 2) combinations: with replacement: distinct items

combinations(5, 2, replace=TRUE)

      [,1] [,2]
 [1,]    1    1
 [2,]    1    2
 [3,]    1    3
 [4,]    1    4
 [5,]    1    5
 [6,]    2    2
 [7,]    2    3
 [8,]    2    4
 [9,]    2    5
[10,]    3    3
[11,]    3    4
[12,]    3    5
[13,]    4    4
[14,]    4    5
[15,]    5    5



# 3) combinations: without replacement: non distinct items

combinations(x = c("a", "b", "c"), freq = c(2, 1, 1), k = 2)

     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "c" 



# 4) combinations: with replacement: non distinct items

combinations(x = c("a", "b", "c"), k = 2, replace = TRUE)  # as `freq` does not matter

     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "b" 
[5,] "b"  "c" 
[6,] "c"  "c" 

# 5) permutations: without replacement: distinct items

permutations(5, 2)

      [,1] [,2]
 [1,]    1    2
 [2,]    1    3
 [3,]    1    4
 [4,]    1    5
 [5,]    2    1
 [6,]    2    3
 [7,]    2    4
 [8,]    2    5
 [9,]    3    1
[10,]    3    2
[11,]    3    4
[12,]    3    5
[13,]    4    1
[14,]    4    2
[15,]    4    3
[16,]    4    5
[17,]    5    1
[18,]    5    2
[19,]    5    3
[20,]    5    4



# 6) permutations: with replacement: distinct items

permutations(5, 2, replace = TRUE)

      [,1] [,2]
 [1,]    1    1
 [2,]    1    2
 [3,]    1    3
 [4,]    1    4
 [5,]    1    5
 [6,]    2    1
 [7,]    2    2
 [8,]    2    3
 [9,]    2    4
[10,]    2    5
[11,]    3    1
[12,]    3    2
[13,]    3    3
[14,]    3    4
[15,]    3    5
[16,]    4    1
[17,]    4    2
[18,]    4    3
[19,]    4    4
[20,]    4    5
[21,]    5    1
[22,]    5    2
[23,]    5    3
[24,]    5    4
[25,]    5    5


# 7) permutations: without replacement: non distinct items

permutations(x = c("a", "b", "c"), freq = c(2, 1, 1), k = 2)

     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "a" 
[5,] "b"  "c" 
[6,] "c"  "a" 
[7,] "c"  "b" 



# 8) permutations: with replacement: non distinct items

permutations(x = c("a", "b", "c"), k = 2, replace = TRUE)  # as `freq` doesn't matter

      [,1] [,2]
 [1,] "a"  "a" 
 [2,] "a"  "b" 
 [3,] "a"  "c" 
 [4,] "b"  "a" 
 [5,] "b"  "b" 
 [6,] "b"  "c" 
 [7,] "c"  "a" 
 [8,] "c"  "b" 
 [9,] "c"  "c" 
Run Code Online (Sandbox Code Playgroud)

与其他包比较

使用arrangements现有包的优点很少.

  1. 整体框架:您不必为不同的方法使用不同的包.

  2. 它非常有效.有关基准测试,请参阅https://randy3k.github.io/arrangements/articles/benchmark.html.

  3. 它具有内存效率,可以生成全部13个!由于矩阵大小的限制,现有的包将无法进行1到13的排列.getnext()迭代器的方法允许用户逐个进行安排.

  4. 生成的排列是字典顺序,对于一些用户可能是期望的.

  • 我认为这个答案可能会通过显示一些讲述每个"问题"故事的输出来改进. (3认同)
  • 我想这通常不是问题,因为包调用另一个包中的函数的推荐方法是“<package>::<function>”。除非涉及脚本,否则具有相似的名称应该不会造成太大伤害......IMO。 (2认同)

Jos*_*ood 19

在R*中走过一片组合

下面,我们将研究具有生成组合和排列功能的软件包.如果我遗漏了任何套餐,请原谅我,请发表评论或更好,编辑这篇文章.

分析大纲:

  1. 介绍
  2. 组合
  3. 排列
  4. 多集
  5. 摘要
  6. 记忆

在我们开始之前,我们注意到,组合/排列置换的不同选择与非distint项,在一个时间是等价的.就是这样,因为当我们有替换时,它并不具体.因此,无论特定元素最初出现多少次,输出将具有重复1到m次的该元素的实例.

1.简介

  1. gtools v 3.8.1
  2. combinat v 0.0-8
  3. multicool v 0.1-10
  4. partitions v 1.9-19
  5. RcppAlgos v 2.0.1(我是作者)
  6. arrangements v 1.1.0
  7. gRbase v 1.8-3

我没有包括permute,permutations或者gRbase::aperm/ar_perm因为它们并不是真正意图攻击这些类型的问题.

| --------------------------------------- 概述 --------- ------------------------------- |

|_______________| gtools | combinat | multicool | partitions | 
|      comb rep |  Yes   |          |           |            | 
|   comb NO rep |  Yes   |   Yes    |           |            | 
|      perm rep |  Yes   |          |           |            |  
|   perm NO rep |  Yes   |   Yes    |    Yes    |    Yes     |
| perm multiset |        |          |    Yes    |            |  
| comb multiset |        |          |           |            |  
|accepts factors|        |   Yes    |           |            |  
|   m at a time |  Yes   |  Yes/No  |           |            |  
|general vector |  Yes   |   Yes    |    Yes    |            |
|    iterable   |        |          |    Yes    |            |
|parallelizable |        |          |           |            |
|  big integer  |        |          |           |            |

|_______________| iterpc | arrangements | RcppAlgos | gRbase |
|      comb rep |  Yes   |     Yes      |    Yes    |        |
|   comb NO rep |  Yes   |     Yes      |    Yes    |  Yes   |   
|      perm rep |  Yes   |     Yes      |    Yes    |        |
|   perm NO rep |  Yes   |     Yes      |    Yes    |   *    |
| perm multiset |  Yes   |     Yes      |    Yes    |        |
| comb multiset |  Yes   |     Yes      |    Yes    |        |
|accepts factors|        |     Yes      |    Yes    |        |
|   m at a time |  Yes   |     Yes      |    Yes    |  Yes   |
|general vector |  Yes   |     Yes      |    Yes    |  Yes   |
|    iterable   |        |     Yes      | Partially |        |
|parallelizable |        |     Yes      |    Yes    |        |
|  big integer  |        |     Yes      |           |        |
Run Code Online (Sandbox Code Playgroud)

的任务,m at a time并且general vector,请参阅"产生结果的能力在时间"(当小于矢量的长度),并重新排列"一般矢量"相对于1:n.在实践中,我们通常关注的是找到一般向量的重新排列,因此下面的所有检查都将反映这一点(如果可能的话).

所有基准都是在3种不同的设置上运行的.

  1. Macbook Pro i7 16Gb
  2. Macbook Air i5 4Gb
  3. 联想运行Windows 7 i5 8Gb

列出的结果是从设置#1(即MBPro)获得的.其他两个系统的结果相似.此外,gc()会定期调用以确保所有内存都可用(请参阅参考资料?gc).

2.组合

首先,我们检查组合,而不是一次选择m.

  1. RcppAlgos
  2. combinat(或utils)
  3. gtools
  4. arrangements
  5. gRbase

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(13)
testVector1 <- sort(sample(100, 17))
m <- 9
t1 <- comboGeneral(testVector1, m)  ## returns matrix with m columns
t3 <- combinat::combn(testVector1, m)  ## returns matrix with m rows
t4 <- gtools::combinations(17, m, testVector1)  ## returns matrix with m columns
identical(t(t3), t4) ## must transpose to compare
#> [1] TRUE
t5 <- combinations(testVector1, m)
identical(t1, t5)
#> [1] TRUE
t6 <- gRbase::combnPrim(testVector1, m)
identical(t(t6)[do.call(order, as.data.frame(t(t6))),], t1)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

基准测试:

microbenchmark(cbRcppAlgos = comboGeneral(testVector1, m),
               cbGRbase = gRbase::combnPrim(testVector1, m),
               cbGtools = gtools::combinations(17, m, testVector1),
               cbCombinat = combinat::combn(testVector1, m),
               cbArrangements = combinations(17, m, testVector1),
               unit = "relative")
#> Unit: relative
#>            expr     min      lq    mean  median      uq    max neval
#>     cbRcppAlgos   1.064   1.079   1.160   1.012   1.086  2.318   100
#>        cbGRbase   7.335   7.509   5.728   6.807   5.390  1.608   100
#>        cbGtools 426.536 408.807 240.101 310.848 187.034 63.663   100
#>      cbCombinat  97.756  97.586  60.406  75.415  46.391 41.089   100
#>  cbArrangements   1.000   1.000   1.000   1.000   1.000  1.000   100
Run Code Online (Sandbox Code Playgroud)

现在,我们一次检查选择m的替换组合.

  1. RcppAlgos
  2. gtools
  3. arrangements

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(97)
testVector2 <- sort(rnorm(10))
m <- 8
t1 <- comboGeneral(testVector2, m, repetition = TRUE)
t3 <- gtools::combinations(10, m, testVector2, repeats.allowed = TRUE)
identical(t1, t3)
#> [1] TRUE
## arrangements
t4 <- combinations(testVector2, m, replace = TRUE)
identical(t1, t4)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

基准测试:

microbenchmark(cbRcppAlgos = comboGeneral(testVector2, m, TRUE),
               cbGtools = gtools::combinations(10, m, testVector2, repeats.allowed = TRUE),
               cbArrangements = combinations(testVector2, m, replace = TRUE),
               unit = "relative")
#> Unit: relative
#>            expr     min      lq   mean  median      uq     max neval
#>     cbRcppAlgos   1.000   1.000  1.000   1.000   1.000 1.00000   100
#>        cbGtools 384.990 269.683 80.027 112.170 102.432 3.67517   100
#>  cbArrangements   1.057   1.116  0.618   1.052   1.002 0.03638   100
Run Code Online (Sandbox Code Playgroud)

3.排列

首先,我们检查排列,而不是一次选择m.

  1. RcppAlgos
  2. gtools
  3. arrangements

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(101)
testVector3 <- as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29))

## RcppAlgos... permuteGeneral same as comboGeneral above
t1 <- permuteGeneral(testVector3, 6)
## gtools... permutations same as combinations above
t3 <- gtools::permutations(10, 6, testVector3)
identical(t1, t3)
#> [1] TRUE
## arrangements
t4 <- permutations(testVector3, 6)
identical(t1, t4)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

基准测试:

microbenchmark(cbRcppAlgos = permuteGeneral(testVector3, 6),
               cbGtools = gtools::permutations(10, 6, testVector3),
               cbArrangements = permutations(testVector3, 6),
               unit = "relative")
#> Unit: relative
#>            expr     min     lq   mean median     uq   max neval
#>     cbRcppAlgos   1.079  1.027  1.106  1.037  1.003  5.37   100
#>        cbGtools 158.720 92.261 85.160 91.856 80.872 45.39   100
#>  cbArrangements   1.000  1.000  1.000  1.000  1.000  1.00   100
Run Code Online (Sandbox Code Playgroud)

接下来,我们检查排列而不用一般向量替换(返回所有排列).

  1. RcppAlgos
  2. gtools
  3. combinat
  4. multicool
  5. arrangements

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(89)
testVector3 <- as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29))
testVector3Prime <- testVector3[1:7]
## For RcppAlgos, & gtools (see above)

## combinat
t4 <- combinat::permn(testVector3Prime) ## returns a list of vectors
## convert to a matrix
t4 <- do.call(rbind, t4)
## multicool.. we must first call initMC
t5 <- multicool::allPerm(multicool::initMC(testVector3Prime)) ## returns a matrix with n columns
all.equal(t4[do.call(order,as.data.frame(t4)),],
          t5[do.call(order,as.data.frame(t5)),])
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

基准测试:

microbenchmark(cbRcppAlgos = permuteGeneral(testVector3Prime, 7),
               cbGtools = gtools::permutations(7, 7, testVector3Prime),
               cbCombinat = combinat::permn(testVector3Prime),
               cbMulticool = multicool::allPerm(multicool::initMC(testVector3Prime)),
               cbArrangements = permutations(x = testVector3Prime, k = 7),
               unit = "relative")
#> Unit: relative
#>            expr      min       lq     mean   median       uq     max neval
#>     cbRcppAlgos    1.152    1.275   0.7508    1.348    1.342  0.3159   100
#>        cbGtools  965.465  817.645 340.4159  818.137  661.068 12.7042   100
#>      cbCombinat  280.207  236.853 104.4777  238.228  208.467  9.6550   100
#>     cbMulticool 2573.001 2109.246 851.3575 2039.531 1638.500 28.3597   100
#>  cbArrangements    1.000    1.000   1.0000    1.000    1.000  1.0000   100
Run Code Online (Sandbox Code Playgroud)

现在,我们检查排列而不替换1:n(返回所有排列).

  1. RcppAlgos
  2. gtools
  3. combinat
  4. multicool
  5. partitions
  6. arrangements

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(89)
t1 <- partitions::perms(7)  ## returns an object of type 'partition' with n rows
identical(t(as.matrix(t1)), permutations(7,7))
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

基准测试:

microbenchmark(cbRcppAlgos = permuteGeneral(7, 7),
               cbGtools = gtools::permutations(7, 7),
               cbCombinat = combinat::permn(7),
               cbMulticool = multicool::allPerm(multicool::initMC(1:7)),
               cbPartitions = partitions::perms(7),
               cbArrangements = permutations(7, 7),
               unit = "relative")
#> Unit: relative
#>            expr      min       lq     mean   median       uq      max
#>     cbRcppAlgos    1.235    1.429    1.412    1.503    1.484    1.720
#>        cbGtools 1152.826 1000.736  812.620  939.565  793.373  499.029
#>      cbCombinat  347.446  304.866  260.294  296.521  248.343  284.001
#>     cbMulticool 3001.517 2416.716 1903.903 2237.362 1811.006 1311.219
#>    cbPartitions    2.469    2.536    2.801    2.692    2.999    2.472
#>  cbArrangements    1.000    1.000    1.000    1.000    1.000    1.000
#>  neval
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100
Run Code Online (Sandbox Code Playgroud)

最后,我们检查替换的排列.

  1. RcppAlgos
  2. iterpc
  3. gtools
  4. arrangements

如何:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(34)
testVector3 <- as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29))
t1 <- permuteGeneral(testVector3, 5, repetition = TRUE)
t3 <- gtools::permutations(10, 5, testVector3, repeats.allowed = TRUE)
t4 <- permutations(x = testVector3, k = 5, replace = TRUE)
Run Code Online (Sandbox Code Playgroud)

This next benchmark is a little surprising given the results up until now.

microbenchmark(cbRcppAlgos = permuteGeneral(testVector3, 5, TRUE),
               cbGtools = gtools::permutations(10, 5, testVector3, repeats.allowed = TRUE),
               cbArrangements = permutations(x = testVector3, k = 5, replace = TRUE),
               unit = "relative")
#> Unit: relative
#>            expr   min     lq  mean median    uq   max neval
#>     cbRcppAlgos 1.106 0.9183 1.200  1.030 1.063 1.701   100
#>        cbGtools 2.426 2.1815 2.068  1.996 2.127 1.367   100
#>  cbArrangements 1.000 1.0000 1.000  1.000 1.000 1.000   100
Run Code Online (Sandbox Code Playgroud)

That is not a typo... gtools::permutations is almost as fast as the other compiled functions. I encourage the reader to go check out the source code of gtools::permutations as it is one of the most elegant displays of programming out there (R or otherwise).

4. Multisets

First, we examine combinations of multisets.

  1. RcppAlgos
  2. arrangements

To find combinations/permutations of multisets, with RcppAlgos use the freqs arguments to specify how many times each element of the source vector, v, is repeated.

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(496)
myFreqs <- sample(1:5, 10, replace = TRUE)
## This is how many times each element will be repeated
myFreqs
#>  [1] 2 4 4 5 3 2 2 2 3 4
testVector4 <- as.integer(c(1, 2, 3, 5, 8, 13, 21, 34, 55, 89))
t1 <- comboGeneral(testVector4, 12, freqs = myFreqs)
t3 <- combinations(freq = myFreqs, k = 12, x = testVector4)
identical(t1, t3)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

Benchmark:

microbenchmark(cbRcppAlgos = comboGeneral(testVector4, 12, freqs = myFreqs),
               cbArrangements = combinations(freq = myFreqs, k = 12, x = testVector4),
               unit = "relative")
#> Unit: relative
#>            expr   min    lq  mean median    uq   max neval
#>     cbRcppAlgos 1.000 1.000 1.000  1.000 1.000 1.000   100
#>  cbArrangements 1.254 1.221 1.287  1.259 1.413 1.173   100
Run Code Online (Sandbox Code Playgroud)

For permutations of multisets chosen m at a time, we have:

  1. RcppAlgos
  2. arrangements

How to:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(8128)
myFreqs <- sample(1:3, 5, replace = TRUE)
testVector5 <- sort(runif(5))
myFreqs
#> [1] 2 2 2 1 3
t1 <- permuteGeneral(testVector5, 7, freqs = myFreqs)
t3 <- permutations(freq = myFreqs, k = 7, x = testVector5)
identical(t1, t3)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

Benchmark:

microbenchmark(cbRcppAlgos = permuteGeneral(testVector5, 7, freqs = myFreqs),
               cbArrangements = permutations(freq = myFreqs, k = 7, x = testVector5),
               unit = "relative")
#> Unit: relative
#>            expr   min    lq  mean median    uq   max neval
#>     cbRcppAlgos 1.461 1.327 1.282  1.177 1.176 1.101   100
#>  cbArrangements 1.000 1.000 1.000  1.000 1.000 1.000   100
Run Code Online (Sandbox Code Playgroud)

For permutations of multisets returning all permutations, we have:

  1. RcppAlgos
  2. multicool
  3. arrangements

How to:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(8128)
myFreqs2 <- c(2,1,2,1,2)
testVector6 <- (1:5)^3
## For multicool, you must have the elements explicitly repeated
testVector6Prime <- rep(testVector6, times = myFreqs2)
t3 <- multicool::allPerm(multicool::initMC(testVector6Prime))

## for comparison
t1 <- permuteGeneral(testVector6, freqs = myFreqs2)
identical(t1[do.call(order,as.data.frame(t1)),],
          t3[do.call(order,as.data.frame(t3)),])
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)

Benchmark:

microbenchmark(cbRcppAlgos = permuteGeneral(testVector6, freqs = myFreqs2),
               cbMulticool = multicool::allPerm(multicool::initMC(testVector6Prime)),
               cbArrangements = permutations(freq = myFreqs2, x = testVector6),
               unit = "relative")
#> Unit: relative
#>            expr      min       lq    mean   median      uq     max neval
#>     cbRcppAlgos    1.276    1.374   1.119    1.461    1.39  0.8856   100
#>     cbMulticool 2434.652 2135.862 855.946 2026.256 1521.74 31.0651   100
#>  cbArrangements    1.000    1.000   1.000    1.000    1.00  1.0000   100
Run Code Online (Sandbox Code Playgroud)

5. Summary

Both gtools and combinat are well established packages for rearranging elements of a vector. With gtools there are a few more options (see the overview above) and with combinat, you can rearrange factors. With multicool, one is able to rearrange multisets. Although partitions and gRbase are limited for the purposes of this question, they are powerhouses packed with highly efficient functions for dealing with partitions and array objects respectively.

arrangements

  1. The output is in dictionary order.
  2. Allows the user to specify the format via the layout argument (r = row-major, c = column-major, and l = list).
  3. Offers convenient methods such as collect & getnext when working with iterators.
  4. Allows for the generation of more than 2^31 - 1 combinations/permutations via getnext. N.B. RcppAlgos (via lower/upper see below) and multicool (via nextPerm) are also capable of doing this.
  5. Speaking of getnext, this function, allows for a specific number of results by utilizing the d argument.
  6. Supports gmp's big integers to compute number of combinations/permutations.

Observe:

library(arrangements)
icomb <- icombinations(1000, 7)
icomb$getnext(d = 5)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,]    1    2    3    4    5    6    7
#> [2,]    1    2    3    4    5    6    8
#> [3,]    1    2    3    4    5    6    9
#> [4,]    1    2    3    4    5    6   10
#> [5,]    1    2    3    4    5    6   11
Run Code Online (Sandbox Code Playgroud)

This feature is really nice when you only want a few combinations/permutations. With traditional methods, you would have to generate all combinations/permutations and then subset. This would render the previous example impossible as there are more than 10^17 results (i.e. ncombinations(1000, 7, bigz = TRUE) = 194280608456793000).

This feature along with the improvements to the generators in arrangements, allow it to be very efficient with respect to memory.

RcppAlgos

  1. The output is in dictionary order.
  2. There are convenient constraint features that we will not discuss here as they are off-topic for this question. I will only note that the types of problems that can be solved by utilizing these features were the motivation for creating this package.
  3. There is an argument upper (formally rowCap) that is analogous to the d argument of getnext.

Observe:

library(RcppAlgos)
comboGeneral(1000, 7, upper = 5)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,]    1    2    3    4    5    6    7
#> [2,]    1    2    3    4    5    6    8
#> [3,]    1    2    3    4    5    6    9
#> [4,]    1    2    3    4    5    6   10
#> [5,]    1    2    3    4    5    6   11
Run Code Online (Sandbox Code Playgroud)
  1. Additionally, as of 2.0.0, there is an argument called lower that allows one to start generation at a specific combination/permutation. This sets up nicely for parallelization and allows for fast generation beyond 2^31 - 1 as chunks are generated independently.

Parallel example with more than 6 billion combinations:

system.time(parallel::mclapply(seq(1,6397478649,4390857), function(x) {
        a <- comboGeneral(25, 15, freqs = c(rep(1:5, 5)), lower = x, upper = x + 4390856)
        ## do something
        x
    }, mc.cores = 7))
#>     user  system elapsed 
#>  510.623 140.970 109.496
Run Code Online (Sandbox Code Playgroud)

In case you were wondering how each package scales, I will leave you with this final example that measures how fast each package can generate over 100 million results (N.B. gtools::combinations is left out here as it will throw the error: evaluation nested too deeply...). Also, we are explicitly calling combn from the utils package because I was unable to get a successful run from combinat::combn. The differences in memory usage between these two is quite bizarre given that they are only marginally different (see ?utils::combn under the "Authors" section).

Observe:

library(RcppAlgos)
library(arrangements)
library(microbenchmark)
options(digits = 4)
set.seed(2187)
testVector7 <- sort(sample(10^7, 10^3))
system.time(utils::combn(testVector7, 3))
#>    user  system elapsed 
#> 179.956   5.687 187.159
system.time(RcppAlgos::comboGeneral(testVector7, 3))
#>    user  system elapsed 
#>   1.136   0.758   1.937
system.time(arrangements::combinations(x = testVector7, k = 3))
#>    user  system elapsed 
#>   1.963   0.930   2.910
system.time(RcppAlgos::permuteGeneral(testVector7[1:500], 3))
#>    user  system elapsed 
#>   1.095   0.631   1.738
system.time(arrangements::permutations(x = testVector7[1:500], k = 3))
#>    user  system elapsed 
#>   1.399   0.584   1.993
Run Code Online (Sandbox Code Playgroud)

6. Memory

When executing comboGeneral as well as arrangements::combinations, the memory will jump almost 2 Gbs before calling gc. This seems about right as #rows * #nols * bytesPerCell / 2^30 bytes = choose(1000,3) * 3 * 4 / 2^30 bytes = (166167000 * 3 * 4)/2^30 = 1.857 Gbs). However, when executing combn, the memory behavior was eratic (e.g. sometimes it would use all 16 Gb of memory and other times it would only spike a couple of Gbs). When I tested this on the Windows set-up, it would often crash.

We can confirm this using Rprof along with summaryRporf. Observe:

Rprof("RcppAlgos.out", memory.profiling = TRUE)
t1 <- RcppAlgos::comboGeneral(testVector7, 3)
Rprof(NULL)
summaryRprof("RcppAlgos.out", memory = "both")$by.total
                          total.time total.pct mem.total self.time self.pct
"CombinatoricsRcpp"              1.2       100    1901.6       1.2      100
"RcppAlgos::comboGeneral"        1.2       100    1901.6       0.0        0

Rprof("arrangements.out", memory.profiling = TRUE)
t3 <- arrangements::combinations(10^3, 3, testVector7)
Rprof(NULL)
summaryRprof("arrangements.out", memory = "both")$by.total
                             total.time total.pct mem.total self.time self.pct
".Call"                            2.08     99.05    1901.6      2.08    99.05
Run Code Online (Sandbox Code Playgroud)

With RcppAlgos & arrangements, mem.total registers just over 1900 Mb.

And here is the memory profile on a smaller vector comparing gtools, utils, and combinat.

testVector7Prime <- testVector7[1:300]

Rprof("combinat.out", memory.profiling = TRUE)
t3 <- combinat::combn(testVector7Prime, 3)
Rprof(NULL)
summaryRprof("combinat.out", memory = "both")$by.total
                  total.time total.pct mem.total self.time self.pct
"combinat::combn"       3.98    100.00    1226.9      3.72    93.47

Rprof("utils.out", memory.profiling = TRUE)
t4 <- utils::combn(testVector7Prime, 3)
Rprof(NULL)
summaryRprof("utils.out", memory = "both")$by.total
               total.time total.pct mem.total self.time self.pct
"utils::combn"       2.52    100.00    1952.7      2.50    99.21

Rprof("gtools.out", memory.profiling = TRUE)
t5 <- gtools::combinations(300, 3, testVector7Prime)
Rprof(NULL)
summaryRprof("gtools.out", memory = "both")$by.total
                      total.time total.pct mem.total self.time self.pct
"rbind"                     4.94     95.00    6741.6      4.40    84.62
Run Code Online (Sandbox Code Playgroud)

Interestingly, utils::combn and combinat::combn use different amounts of memory and take different amounts of time to execute. This does not hold up with smaller vectors:

microbenchmark(combinat::combn(2:13, 6), utils::combn(2:13, 6))
Unit: microseconds
                    expr     min      lq     mean  median       uq      max neval
combinat::combn(2:13, 6) 527.378 567.946 629.1268 577.163 604.3270 1816.744   100
   utils::combn(2:13, 6) 663.150 712.872 750.8008 725.716 771.1345 1205.697   100
Run Code Online (Sandbox Code Playgroud)

And with gtools the total memory used is a little over 3x as much as utils. It should be noted that for these 3 packages, I obtained different results every-time I ran them (e.g. for combinat::combn sometimes I would get 9000 Mb and then I would get 13000 Mb).

Still, none can match RcppAlgos OR arrangements. Both only use 51 Mb when ran on the example above.

benchmark script: https://gist.github.com/randy3k/bd5730a6d70101c7471f4ae6f453862e (rendered by https://github.com/tidyverse/reprex)

*: An homage to A Walk through Combinatorics by Miklós Bóna

  • 我惊呆了!对这个主题的探索多么彻底!谢谢! (2认同)