按组在R中创建组合

Sha*_*Les 9 algorithm statistics combinations r

我想为我的教室创建一个清单,列出每组4名学生的可能。如果我有20个学生,我该如何在R中按组创建该对象,其中我的行是每个组合,并且学生ID的完整列表有20列,第1-4列是“ group1”,第5-9列是“ group2”等。

下面列出了每组4个学生(x1,x2,x3和x4)的可能组合。现在,对于列出的每一行,其他4组4个学生的可能性是什么?因此,应该有20列(Group1_1:4,Group2_1:4,Group3_1:4,Group4_1:4,Group5_1:4)。

combn(c(1:20), m = 4)
Run Code Online (Sandbox Code Playgroud)

期望的输出

Combination 1 = Group1[1, 2, 3, 4] Group2[5, 6, 7, 8], Group3[9, 10, 11, 12], etc. 
Combination 2 = Group1[1, 2, 3, 5]... etc. 
Run Code Online (Sandbox Code Playgroud)

那里有很多关于组合的帖子,可能这已经被回答了,我只是找不到。任何帮助表示赞赏!

Jon*_*ing 5

从计算上讲,这是一个具有挑战性的问题,因为我相信有25亿种可能性可以列举。(如果弄错了,我欢迎您对这种方法在哪里出错有任何见解。)

根据存储方式的不同,包含所有这些分组的表可能需要更多的RAM,这是大多数计算机无法处理的。看到创建它的有效方法会让我印象深刻。如果我们采用“一次创建一个组合”的方法,那么如果我们每秒可以产生1,000,000,则仍然需要41分钟才能产生所有可能性,如果每秒只能产生1,000,则需要一个月。

编辑-在底部添加了部分实现,以创建从#1到#2,546,168,625的任何所需分组。出于某些目的,这可能与实际存储整个序列几乎一样好,整个序列非常大。


假设我们要分成5组,每组四个学生:A,B,C,D和E组。

让我们将组A定义为学生#1所在的组。它们可以与其他19个学生中的任何三个配对。我相信其他学生有969种这样的组合:

> nrow(t(combn(1:19, 3)))
[1] 969
Run Code Online (Sandbox Code Playgroud)

现在有16名学生留给其他小组。让我们将尚未在A组中的第一个学生分配到B组中。可能是学生2、3、4或5。我们需要知道的是,只有15个学生可以与该学生配对。共有455种这样的组合:

> nrow(t(combn(1:15, 3)))
[1] 455
Run Code Online (Sandbox Code Playgroud)

现在还有12名学生。同样,让我们​​将第一个未分组的学生分配到C组,剩下的165个组合与其他11个学生一起使用:

> nrow(t(combn(1:11, 3)))
[1] 165
Run Code Online (Sandbox Code Playgroud)

我们还有8名学生,其中7名可以通过35种方式与第一个未分组的学生配对进入D组:

> nrow(t(combn(1:7, 3)))
[1] 35
Run Code Online (Sandbox Code Playgroud)

然后,确定了我们的其他组之后,只剩下一组四个学生,其中三个可以与第一个未分组的学生配对:

> nrow(t(combn(1:3, 3)))
[1] 1
Run Code Online (Sandbox Code Playgroud)

这意味着2.546B组合:

> 969*455*165*35*1
[1] 2546168625
Run Code Online (Sandbox Code Playgroud)

这是一个进行中的函数,可根据任意序列号生成分组。

1)[进行中]将序列号转换为向量,该向量描述应将哪个#组合用于组A,B,C,D和E。例如,这应将#1转换为c(1, 1, 1, 1, 1)和#2,546,168,625转换为c(969, 455, 165, 35, 1)

2)将组合转换为描述每个组中学生的特定输出。

groupings <- function(seq_nums) {
  students <- 20
  group_size = 4
  grouped <- NULL
  remaining <- 1:20
  seq_nums_pad <- c(seq_nums, 1) # Last group always uses the only possible combination
  for (g in 1:5) {
    group_relative <- 
      c(1, 1 + t(combn(1:(length(remaining) - 1), group_size - 1))[seq_nums_pad[g], ])
    group <- remaining[group_relative]
    print(group)
    grouped = c(grouped, group)
    remaining <-  setdiff(remaining, grouped)
  }
}

> groupings(c(1,1,1,1))
#[1] 1 2 3 4
#[1] 5 6 7 8
#[1]  9 10 11 12
#[1] 13 14 15 16
#[1] 17 18 19 20
> groupings(c(1,1,1,2))
#[1] 1 2 3 4
#[1] 5 6 7 8
#[1]  9 10 11 12
#[1] 13 14 15 17
#[1] 16 18 19 20
> groupings(c(969, 455, 165, 35))   # This one uses the last possibility for
#[1]  1 18 19 20                    #   each grouping.
#[1]  2 15 16 17
#[1]  3 12 13 14
#[1]  4  9 10 11
#[1] 5 6 7 8
Run Code Online (Sandbox Code Playgroud)


Col*_*ole 5

This relies heavily on this answer:

Algorithm that can create all combinations and all groups of those combinations

One thing to note is that the answer is not that dynamic - it only included a solution for groups of 3. To make it more robust, we can create the code based on the input parameters. That is, the following recursive function is created on the fly for groups 3:

group <- function(input, step){
 len <- length(input) 
 combination[1, step] <<- input[1] 

 for (i1 in 2:(len-1)) { 
   combination[2, step] <<- input[i1] 

   for (i2 in (i1+1):(len-0)) { 
     combination[3, step] <<- input[i2] 

     if (step == m) { 
       print(z); result[z, ,] <<- combination 
       z <<- z+1 
     } else { 
       rest <- setdiff(input, input[c(i1,i2, 1)]) 
       group(rest, step +1) #recursive if there are still additional possibilities
   }} 
 } 
}
Run Code Online (Sandbox Code Playgroud)

This takes around 55 seconds to run for N = 16 and k = 4. I'd like to translate it into Rcpp but unfortunately I do not have that skillset.

group_N <- function(input, k = 2) {
  N = length(input)
  m = N/k
  combos <- factorial(N) / (factorial(k)^m * factorial(m))

  result <- array(NA_integer_, dim = c(combos, m, k))
  combination = matrix(NA_integer_, nrow = k, ncol = m)

  z = 1

  group_f_start = 'group <- function(input, step){\n len <- length(input) \n combination[1,  step] <<- input[1] \n '
  i_s <- paste0('i', seq_len(k-1))

  group_f_fors = paste0('for (', i_s, ' in ', c('2', if (length(i_s) != 1) {paste0('(', i_s[-length(i_s)], '+1)')}), ':(len-', rev(seq_len(k)[-k])-1, ')) { \n combination[', seq_len(k)[-1], ', step] <<- input[', i_s, '] \n', collapse = '\n ')

  group_f_inner = paste0('if (step == m) { \n result[z, ,] <<- combination \n z <<- z+1 \n } else { \n rest <- setdiff(input, input[c(',
                         paste0(i_s, collapse = ','),
                         ', 1)]) \n group(rest, step +1) \n }')

  eval(parse(text = paste0(group_f_start, group_f_fors, group_f_inner, paste0(rep('}', times = k), collapse = ' \n '))))

  group(input, 1)
  return(result)
}
Run Code Online (Sandbox Code Playgroud)

Performance

system.time({test_1 <- group_N(seq_len(4), 2)})
#   user  system elapsed 
#   0.01    0.00    0.02
library(data.table)

#this funky step is just to better show the groups. the provided
## array is fine.

as.data.table(t(rbindlist(as.data.table(apply(test_1, c(1,3), list)))))
#    V1  V2
#1: 1,2 3,4
#2: 1,3 2,4
#3: 1,4 2,3

system.time({test_1 <- group_N(seq_len(16), 4)})
#   user  system elapsed 
#  55.00    0.19   55.29 

as.data.table(t(rbindlist(as.data.table(apply(test_1, c(1,3), list)))))
#very slow
#                  V1          V2          V3          V4
#      1:     1,2,3,4     5,6,7,8  9,10,11,12 13,14,15,16
#      2:     1,2,3,4     5,6,7,8  9,10,11,13 12,14,15,16
#      3:     1,2,3,4     5,6,7,8  9,10,11,14 12,13,15,16
#      4:     1,2,3,4     5,6,7,8  9,10,11,15 12,13,14,16
#      5:     1,2,3,4     5,6,7,8  9,10,11,16 12,13,14,15
#     ---                                                
#2627621:  1,14,15,16  2,11,12,13  3, 6, 9,10     4,5,7,8
#2627622:  1,14,15,16  2,11,12,13     3,7,8,9  4, 5, 6,10
#2627623:  1,14,15,16  2,11,12,13  3, 7, 8,10     4,5,6,9
#2627624:  1,14,15,16  2,11,12,13  3, 7, 9,10     4,5,6,8
#2627625:  1,14,15,16  2,11,12,13  3, 8, 9,10     4,5,6,7
Run Code Online (Sandbox Code Playgroud)


Jos*_*ood 5

当前,这是在RcppAlgos*的开发版本中实现的,并将在CRAN的下一个正式版本中实现。

devtools::install_github("jwood000/RcppAlgos")
comboGroups(10, numGroups = 5)

dim(a)
[1] 126   5   2

a[1,,]
     Grp1 Grp2
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

a[126,,]
     Grp1 Grp2
[1,]    1    2
[2,]    7    3
[3,]    8    4
[4,]    9    5
[5,]   10    6
Run Code Online (Sandbox Code Playgroud)

或者,如果您更喜欢矩阵:

a1 <- comboGroups(10, 2, retType = "matrix")

head(a1)
     Grp1 Grp1 Grp1 Grp1 Grp1 Grp2 Grp2 Grp2 Grp2 Grp2
[1,]    1    2    3    4    5    6    7    8    9   10
[2,]    1    2    3    4    6    5    7    8    9   10
[3,]    1    2    3    4    7    5    6    8    9   10
[4,]    1    2    3    4    8    5    6    7    9   10
[5,]    1    2    3    4    9    5    6    7    8   10
[6,]    1    2    3    4   10    5    6    7    8    9
Run Code Online (Sandbox Code Playgroud)

这也真的很快。您甚至可以与nThreadsParallel = TRUE(后者使用一个减去系统最大线程数)并行生成,以提高效率:

comboGroupsCount(16, 4)
[1] 2627625

system.time(comboGroups(16, 4, "matrix"))
 user  system elapsed 
0.113   0.033   0.146

system.time(comboGroups(16, 4, "matrix", nThreads = 8))
 user  system elapsed 
0.182   0.179   0.047
                                ## 7 threads on my machine
system.time(comboGroups(16, 4, "matrix", Parallel = TRUE))
 user  system elapsed 
0.176   0.175   0.053
Run Code Online (Sandbox Code Playgroud)

一个非常好的功能是能够生成样本或特定词典组合组,尤其是在结果数量很大时。

comboGroupsCount(factor(state.abb), numGroups = 10)
Big Integer ('bigz') :
[1] 13536281554808237495608549953475109376

mySamp <- comboGroupsSample(factor(state.abb), 
                            numGroups = 10, n = 5, seed = 42)

mySamp[1,,]
     Grp1 Grp2 Grp3 Grp4 Grp5 Grp`6 Grp7 Grp8 Grp9 Grp10
[1,] AL   AK   AR   CA   CO   CT   DE   FL   LA   MD   
[2,] IA   AZ   ME   ID   GA   OR   IL   IN   MS   NM   
[3,] KY   ND   MO   MI   HI   PA   MN   KS   MT   OH   
[4,] TX   RI   SC   NH   NV   WI   NE   MA   NY   TN  
[5,] VA   VT   UT   OK   NJ   WY   WA   NC   SD   WV   
50 Levels: AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH ... WY

firstAndLast <- comboGroupsSample(state.abb, 10,
                                  sampleVec = c("1",
                                                "13536281554808237495608549953475109376"))

firstAndLast[1,,]
     Grp1 Grp2 Grp3 Grp4 Grp5 Grp6 Grp7 Grp8 Grp9 Grp10
[1,] "AL" "CO" "HI" "KS" "MA" "MT" "NM" "OK" "SD" "VA" 
[2,] "AK" "CT" "ID" "KY" "MI" "NE" "NY" "OR" "TN" "WA" 
[3,] "AZ" "DE" "IL" "LA" "MN" "NV" "NC" "PA" "TX" "WV" 
[4,] "AR" "FL" "IN" "ME" "MS" "NH" "ND" "RI" "UT" "WI" 
[5,] "CA" "GA" "IA" "MD" "MO" "NJ" "OH" "SC" "VT" "WY"

firstAndLast[2,,]
     Grp1 Grp2 Grp3 Grp4 Grp5 Grp6 Grp7 Grp8 Grp9 Grp10
[1,] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" 
[2,] "WA" "TX" "RI" "OH" "NM" "NE" "MN" "ME" "IA" "HI" 
[3,] "WV" "UT" "SC" "OK" "NY" "NV" "MS" "MD" "KS" "ID" 
[4,] "WI" "VT" "SD" "OR" "NC" "NH" "MO" "MA" "KY" "IL" 
[5,] "WY" "VA" "TN" "PA" "ND" "NJ" "MT" "MI" "LA" "IN"
Run Code Online (Sandbox Code Playgroud)

最后,2,546,168,625使用lowerand upper参数可以在一分钟内实现将20个人的所有组合分组为5组(OP要求的):

system.time(aPar <- parallel::mclapply(seq(1, 2546168625, 969969), function(x) {
     combs <- comboGroups(20, 5, lower = x, upper = x + 969968)
     ### do something
     dim(combs)
}, mc.cores = 4))
   user  system elapsed 
100.432  17.562  39.378

sum(sapply(aPar, "[", 1))
[1] 2546168625
Run Code Online (Sandbox Code Playgroud)

尽管我一年多以前就开始研究这个问题,但是这个问题对于将其形式化为软件包是一个巨大的启发。

*我是的作者RcppAlgos


Mon*_*onk -1

下面的代码给出了从 20 个中选择的 4 个的所有唯一组合,没有重复。

x <- c(1:20)
combinations <- data.frame(t(combn(x, 4)))
Run Code Online (Sandbox Code Playgroud)

  • 这给出了每组 4 名学生(x1、x2、x3 和 x4)的可能组合列表。现在,对于列出的每一行,其他 4 组 4 名学生的可能性是什么?因此,应该有 20 列(Group1_1:4、Group2_1:4、Group3_1:4、Group4_1:4、Group5_1:4)。让我知道是否可以解决问题。 (3认同)