R使用data.table获取组的组合(用于输入igraph)

wya*_*att 1 r igraph data.table

我有一个像这样的data.table:

dt<-data.table(group=(c(rep("A", 4), rep("B", 3), rep("C", 2))),
       individual=c("Ava", "Bill", "Claire", "Daniel", "Evelyn", "Francis", "Grant", "Helen", "Ig"))
Run Code Online (Sandbox Code Playgroud)

我想改变这样的事情:

dt2<-data.table(group=(c(rep("A", 6), rep("B", 3), rep("C", 1))), edge1=c("Ava", "Ava", "Ava", "Bill", "Bill", "Claire", "Evelyn", "Evelyn", "Francis", "Helen"), edge2=c("Bill", "Claire", "Daniel", "Claire", "Daniel", "Daniel", "Francis", "Grant", "Grant", "Ig"))
Run Code Online (Sandbox Code Playgroud)

基本上,第二个表的每一行在第一个表中采用"按组分组的两个人".整个想法是将数据输入igraph进行网络分析.如果有更好的解决方案用于此目的,我们非常欢迎.

Mic*_*ico 7

感谢@ mt1022,谁帮助的亮点是实施combnbaseR是非常慢(它是R中实现).因此,我们可以从这个问答中采取措施,加快速度combn,使这种方法更有效率.我无法gRbase在我的机器上安装,因此我从中获取代码comb2.int并将其放入我的方法中:

dt[ , {
  edge1 = rep(1:.N, (.N:1) - 1L)
  i = 2L:(.N * (.N - 1L) / 2L + 1L)
  o = cumsum(c(0, (.N-2L):1))
  edge2 = i - o[edge1]
  .(edge1 = edge1, edge2 = edge2)
}, by = group]
Run Code Online (Sandbox Code Playgroud)

这大大加快了OP数据集的增强版本的速度:

max_g = 1e3
dt = data.table(
  group = rep(LETTERS, sample(max_g, 26, TRUE))
)
dt[ , individual := as.character(.I)]

library(microbenchmark)
microbenchmark(
  times = 10L,
  combn = dt[ , transpose(combn(individual, 2, simplify = FALSE)), by = group],
  cj = dt[ , CJ(edge1 = individual, edge2 = individual), by = group
           ][edge1 < edge2],
  fast_combn = dt[ , {
    edge1 = rep(1:.N, (.N:1) - 1L)
    i = 2L:(.N * (.N - 1L) / 2L + 1L)
    o = cumsum(c(0, (.N-2L):1))
    edge2 = i - o[edge1]
    .(edge1 = edge1, edge2 = edge2)
  }, by = group]
)
# Unit: milliseconds
#        expr       min        lq     mean    median        uq       max neval
#       combn 3075.8078 3247.8300 3905.831 3482.9950 4289.8168 6180.1138    10
#          cj 2495.1798 2549.1552 3830.492 4014.6591 4959.2004 5239.7905    10
#  fast_combn  180.1348  217.9098  294.235  284.8854  329.5982  493.4744    10
Run Code Online (Sandbox Code Playgroud)

也就是说,虽然最初的combn方法和建议的方法CJ取决于数据特征,但这种方法在大数据方面要好得多.


原始方法与刚刚 combn

我们可以这样使用combn:

dt2 = dt[ , transpose(combn(individual, 2, simplify = FALSE)), by = group]
Run Code Online (Sandbox Code Playgroud)

默认情况下,combn将返回一个2 x n矩阵,其中n = choose(.N, 2).N是每个组的大小.

simplify = FALSE而是返回一个长度n list的元组; transpose此转换为长度- 2 listn元组(有效).

然后修复名称:

setnames(dt2, c('V1', 'V2'), c('edge1', 'edge2'))
Run Code Online (Sandbox Code Playgroud)


mt1*_*022 5

您可以通过以下方式实现CJ:

dt[, CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2]
#     group   edge1   edge2
#  1:     A     Ava    Bill
#  2:     A     Ava  Claire
#  3:     A     Ava  Daniel
#  4:     A    Bill  Claire
#  5:     A    Bill  Daniel
#  6:     A  Claire  Daniel
#  7:     B  Evelyn Francis
#  8:     B  Evelyn   Grant
#  9:     B Francis   Grant
# 10:     C   Helen      Ig
Run Code Online (Sandbox Code Playgroud)
讨论

正如MichaelChirico所指出的,这需要更多的内存.对于大小为n的组,CJ将创建n ^ 2行,而combn将创建n(n-1)/ 2行.该比率为n ^ 2 /(n(n-1)/ 2)= 2n /(n-1)~2.

对于在内存和速度方面更高效的方法,请参阅fast_combnMiclaelChirico的回答.


编辑

添加了combn枚举的Rcpp实现:

library(Rcpp)
cppFunction(
    'List combnCpp(CharacterVector x) {
    const int n = x.size();
    x.sort();
    CharacterVector combn1 = CharacterVector(n*(n-1)/2);
    CharacterVector combn2 = CharacterVector(n*(n-1)/2);
    int idx = 0;
    for(int i = 0; i < n - 1; i++) {
        for(int j = i + 1; j < n; j++){
            combn1[idx] = x[i];
            combn2[idx] = x[j];
            idx++;
        }
    }
    return List::create(_["V1"] = combn1, _["V2"] = combn2);
}')

combnCpp = dt[ , combnCpp(individual), by = group]
Run Code Online (Sandbox Code Playgroud)

以下是使用@ MichaelChirico代码的基准测试:

library(data.table)
max_g = 1e3
set.seed(123)
dt = data.table(
    group = rep(LETTERS, sample(max_g, 26, TRUE))
)
dt[ , individual := as.character(.I)]

library(gRbase)
library(microbenchmark)
microbenchmark(
    times = 10L,
    cpp_combn = dt[ , combnCpp(individual), by = group],
    gRbase = dt[ , transpose(combnPrim(individual, 2, simplify = FALSE)), by = group],
    CJ = dt[ , CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2],
    fast_combn = dt[ , {
        edge1 = rep(1:.N, (.N:1) - 1L)
        i = 2L:(.N * (.N - 1L) / 2L + 1L)
        o = cumsum(c(0, (.N-2L):1))
        edge2 = i - o[edge1]
        .(edge1 = edge1, edge2 = edge2)
    }, by = group]
)
# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
#   cpp_combn  247.6795  284.3614  324.2149  305.1760  347.1372  499.9442    10
#      gRbase 1115.0338 1299.2865 1341.3890 1339.3950 1378.6571 1517.2534    10
#          CJ 1455.2715 1481.8725 1630.0190 1616.7780 1754.3922 1879.5768    10
#  fast_combn  128.5774  153.4234  215.5325  166.7491  319.1567  363.3657    10
Run Code Online (Sandbox Code Playgroud)

combnCpp仍然是慢〜2倍fast_combn,这可能是由于这一事实combnCpp正在做枚举,而fast_combn正在做计算.可能的改进combnCpp是计算指数fast_combn而不是枚举.