wya*_*att 1 r igraph data.table
我有一个像这样的data.table:
dt<-data.table(group=(c(rep("A", 4), rep("B", 3), rep("C", 2))),
individual=c("Ava", "Bill", "Claire", "Daniel", "Evelyn", "Francis", "Grant", "Helen", "Ig"))
Run Code Online (Sandbox Code Playgroud)
我想改变这样的事情:
dt2<-data.table(group=(c(rep("A", 6), rep("B", 3), rep("C", 1))), edge1=c("Ava", "Ava", "Ava", "Bill", "Bill", "Claire", "Evelyn", "Evelyn", "Francis", "Helen"), edge2=c("Bill", "Claire", "Daniel", "Claire", "Daniel", "Daniel", "Francis", "Grant", "Grant", "Ig"))
Run Code Online (Sandbox Code Playgroud)
基本上,第二个表的每一行在第一个表中采用"按组分组的两个人".整个想法是将数据输入igraph进行网络分析.如果有更好的解决方案用于此目的,我们非常欢迎.
感谢@ mt1022,谁帮助的亮点是实施combn中baseR是非常慢(它是R中实现).因此,我们可以从这个问答中采取措施,加快速度combn,使这种方法更有效率.我无法gRbase在我的机器上安装,因此我从中获取代码comb2.int并将其放入我的方法中:
dt[ , {
edge1 = rep(1:.N, (.N:1) - 1L)
i = 2L:(.N * (.N - 1L) / 2L + 1L)
o = cumsum(c(0, (.N-2L):1))
edge2 = i - o[edge1]
.(edge1 = edge1, edge2 = edge2)
}, by = group]
Run Code Online (Sandbox Code Playgroud)
这大大加快了OP数据集的增强版本的速度:
max_g = 1e3
dt = data.table(
group = rep(LETTERS, sample(max_g, 26, TRUE))
)
dt[ , individual := as.character(.I)]
library(microbenchmark)
microbenchmark(
times = 10L,
combn = dt[ , transpose(combn(individual, 2, simplify = FALSE)), by = group],
cj = dt[ , CJ(edge1 = individual, edge2 = individual), by = group
][edge1 < edge2],
fast_combn = dt[ , {
edge1 = rep(1:.N, (.N:1) - 1L)
i = 2L:(.N * (.N - 1L) / 2L + 1L)
o = cumsum(c(0, (.N-2L):1))
edge2 = i - o[edge1]
.(edge1 = edge1, edge2 = edge2)
}, by = group]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# combn 3075.8078 3247.8300 3905.831 3482.9950 4289.8168 6180.1138 10
# cj 2495.1798 2549.1552 3830.492 4014.6591 4959.2004 5239.7905 10
# fast_combn 180.1348 217.9098 294.235 284.8854 329.5982 493.4744 10
Run Code Online (Sandbox Code Playgroud)
也就是说,虽然最初的combn方法和建议的方法CJ取决于数据特征,但这种方法在大数据方面要好得多.
combn我们可以这样使用combn:
dt2 = dt[ , transpose(combn(individual, 2, simplify = FALSE)), by = group]
Run Code Online (Sandbox Code Playgroud)
默认情况下,combn将返回一个2 x n矩阵,其中n = choose(.N, 2)和.N是每个组的大小.
simplify = FALSE而是返回一个长度n list的元组; transpose此转换为长度- 2 list的n元组(有效).
然后修复名称:
setnames(dt2, c('V1', 'V2'), c('edge1', 'edge2'))
Run Code Online (Sandbox Code Playgroud)
您可以通过以下方式实现CJ:
dt[, CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2]
# group edge1 edge2
# 1: A Ava Bill
# 2: A Ava Claire
# 3: A Ava Daniel
# 4: A Bill Claire
# 5: A Bill Daniel
# 6: A Claire Daniel
# 7: B Evelyn Francis
# 8: B Evelyn Grant
# 9: B Francis Grant
# 10: C Helen Ig
Run Code Online (Sandbox Code Playgroud)
正如MichaelChirico所指出的,这需要更多的内存.对于大小为n的组,CJ将创建n ^ 2行,而combn将创建n(n-1)/ 2行.该比率为n ^ 2 /(n(n-1)/ 2)= 2n /(n-1)~2.
对于在内存和速度方面更高效的方法,请参阅fast_combnMiclaelChirico的回答.
添加了combn枚举的Rcpp实现:
library(Rcpp)
cppFunction(
'List combnCpp(CharacterVector x) {
const int n = x.size();
x.sort();
CharacterVector combn1 = CharacterVector(n*(n-1)/2);
CharacterVector combn2 = CharacterVector(n*(n-1)/2);
int idx = 0;
for(int i = 0; i < n - 1; i++) {
for(int j = i + 1; j < n; j++){
combn1[idx] = x[i];
combn2[idx] = x[j];
idx++;
}
}
return List::create(_["V1"] = combn1, _["V2"] = combn2);
}')
combnCpp = dt[ , combnCpp(individual), by = group]
Run Code Online (Sandbox Code Playgroud)
以下是使用@ MichaelChirico代码的基准测试:
library(data.table)
max_g = 1e3
set.seed(123)
dt = data.table(
group = rep(LETTERS, sample(max_g, 26, TRUE))
)
dt[ , individual := as.character(.I)]
library(gRbase)
library(microbenchmark)
microbenchmark(
times = 10L,
cpp_combn = dt[ , combnCpp(individual), by = group],
gRbase = dt[ , transpose(combnPrim(individual, 2, simplify = FALSE)), by = group],
CJ = dt[ , CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2],
fast_combn = dt[ , {
edge1 = rep(1:.N, (.N:1) - 1L)
i = 2L:(.N * (.N - 1L) / 2L + 1L)
o = cumsum(c(0, (.N-2L):1))
edge2 = i - o[edge1]
.(edge1 = edge1, edge2 = edge2)
}, by = group]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# cpp_combn 247.6795 284.3614 324.2149 305.1760 347.1372 499.9442 10
# gRbase 1115.0338 1299.2865 1341.3890 1339.3950 1378.6571 1517.2534 10
# CJ 1455.2715 1481.8725 1630.0190 1616.7780 1754.3922 1879.5768 10
# fast_combn 128.5774 153.4234 215.5325 166.7491 319.1567 363.3657 10
Run Code Online (Sandbox Code Playgroud)
将combnCpp仍然是慢〜2倍fast_combn,这可能是由于这一事实combnCpp正在做枚举,而fast_combn正在做计算.可能的改进combnCpp是计算指数fast_combn而不是枚举.