Per*_*nkf 11 r data.table
我正在尝试定义一个函数,该函数将数据框或表作为具有特定数量的 ID 列(例如,2 或 3 个 ID 列)的输入,其余列是 NAME1、NAME2、...、NAMEK(数字列) )。输出应该是一个数据表,其中包含与之前相同的 ID 列以及一个附加 ID 列,该 ID 列对列名称的每个唯一的成对组合进行分组(NAME1、NAME2、...)。另外,我们必须根据ID列将数字列的实际值收集到两个新列中;具有两个 ID 列和三个数字列的示例:
ID1 <- c("A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3)
NAME1 <- c(10,11,9,22,25,22)
NAME2 <- c(7,9,8,20,22,21)
NAME3 <- c(10,12,11,15,19,30)
DT <- data.table(ID1,ID2,NAME1,NAME2,NAME3)
Run Code Online (Sandbox Code Playgroud)
我希望以 DT 作为输入的函数的输出为
ID.new <- c("NAME1 - NAME2","NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME2",
"NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME3", "NAME1 - NAME3",
"NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3",
"NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3",
"NAME2 - NAME3", "NAME2 - NAME3")
ID1 <- c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
value.left <- c(10,11,9,22,25,22,10,11,9,22,25,22,7,9,8,20,22,21)
value.right <- c(7,9,8,20,22,21,10,12,11,15,19,30,10,12,11,15,19,30)
DT.output <- data.table(ID.new,ID1,ID2,value.left,value.right)
Run Code Online (Sandbox Code Playgroud)
我发现 fun() (见下文)可以完成这项工作,但对于我来说太慢了:
fun <- function(data, ID.cols){
data <- data.table(data)
# Which of the columns are ID columns
ids <- which(colnames(data) %in% ID.cols)
# Obtain all pairwise combinations of numeric columns into a list
numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
id.cols <- data[,ids, with = FALSE]
# bind the ID columns to each pairwise combination of numeric columns inside the list
bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x))
# Create generic names for the numeric columns so that rbindlist() may be applied. In addition we make a new column that groups based on which columns we are considering
generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x)
setattr(x = x[,ID.NEW:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name =
'names', value = c(ID.cols,"value.left","value.right","ID.NEW"))))
return(rbindlist(l=generalize))
}
# Performance
print(microbenchmark(fun(DT,ID.cols=c("ID1","ID2")),times=1000))
Run Code Online (Sandbox Code Playgroud)
有没有更快、更优雅的方法来做到这一点?
熔化的自连接选项:
library(data.table)
DTlong <- melt(DT, id.vars = c("ID1", "ID2"), variable.factor = FALSE)
out <- DTlong[DTlong, on = .(ID1, ID2), allow.cartesian = TRUE
][variable < i.variable,
][, .(ID.new = paste(variable, i.variable, sep = " - "),
ID1, ID2, value.left = value, value.right = i.value)]
out
# ID.new ID1 ID2 value.left value.right
# <char> <char> <num> <num> <num>
# 1: NAME1 - NAME2 A 1 10 7
# 2: NAME1 - NAME2 A 2 11 9
# 3: NAME1 - NAME2 A 3 9 8
# 4: NAME1 - NAME2 B 1 22 20
# 5: NAME1 - NAME2 B 2 25 22
# 6: NAME1 - NAME2 B 3 22 21
# 7: NAME1 - NAME3 A 1 10 10
# 8: NAME2 - NAME3 A 1 7 10
# 9: NAME1 - NAME3 A 2 11 12
# 10: NAME2 - NAME3 A 2 9 12
# 11: NAME1 - NAME3 A 3 9 11
# 12: NAME2 - NAME3 A 3 8 11
# 13: NAME1 - NAME3 B 1 22 15
# 14: NAME2 - NAME3 B 1 20 15
# 15: NAME1 - NAME3 B 2 25 19
# 16: NAME2 - NAME3 B 2 22 19
# 17: NAME1 - NAME3 B 3 22 30
# 18: NAME2 - NAME3 B 3 21 30
### validation
setorder(out, ID.new, ID1, ID2)
identical(DT.output, out)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)
当然,这种方法combn是合理的想法,但它唯一的低效率是每个组合都会迭代一次。也就是说,combn(..., FUN=)在本例中,传递给的函数被调用了 18 次;如果你的数据更大,它会被调用很多次。然而,在此处的 /join 的情况下merge,一切都是以我们可以管理的矢量化方式完成的:merge高效地完成,过滤作为单个逻辑向量返回,并且也是paste(..)一个大向量。
公平地说,合并概念确实有其自身的低效率:由于笛卡尔连接,它最初生成 54 行。对于更大的数据,这将导致内存耗尽问题。如果您遇到这种情况,使用fuzzyjoin和包含variable < variable(LHS 与 RHS)可能会有所帮助,这应该可以减少(如果不能完全消除)问题。
最后的建议也可以通过以下方式完成sqldf:
sqldf::sqldf("
select t1.variable || ' - ' || t2.variable as [ID.new], t1.ID1, t1.ID2,
t1.value as [value.left], t2.value as [value.right]
from DTlong t1
join DTlong t2 on t1.ID1=t2.ID1 and t1.ID2=t2.ID2
and t1.variable < t2.variable")
# ID.new ID1 ID2 value.left value.right
# 1 NAME1 - NAME2 A 1 10 7
# 2 NAME1 - NAME3 A 1 10 10
# 3 NAME1 - NAME2 A 2 11 9
# 4 NAME1 - NAME3 A 2 11 12
# 5 NAME1 - NAME2 A 3 9 8
# 6 NAME1 - NAME3 A 3 9 11
# 7 NAME1 - NAME2 B 1 22 20
# 8 NAME1 - NAME3 B 1 22 15
# 9 NAME1 - NAME2 B 2 25 22
# 10 NAME1 - NAME3 B 2 25 19
# 11 NAME1 - NAME2 B 3 22 21
# 12 NAME1 - NAME3 B 3 22 30
# 13 NAME2 - NAME3 A 1 7 10
# 14 NAME2 - NAME3 A 2 9 12
# 15 NAME2 - NAME3 A 3 8 11
# 16 NAME2 - NAME3 B 1 20 15
# 17 NAME2 - NAME3 B 2 22 19
# 18 NAME2 - NAME3 B 3 21 30
Run Code Online (Sandbox Code Playgroud)
基准测试:
bench::mark(
pernkf = fun(DT, c("ID1", "ID2")),
tjebo = fun2(DT, c("ID1", "ID2")),
r2evans = fun3(DT, c("ID1", "ID2")), # native data.table
r2evans2 = fun4(), # sqldf
check = FALSE)
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 pernkf 5.38ms 6.06ms 161. 287KB 13.2 61 5 379ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 2 tjebo 5.08ms 5.63ms 172. 230KB 8.83 78 4 453ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 3 r2evans 2.97ms 3.48ms 280. 170KB 11.0 127 5 454ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 4 r2evans2 17.19ms 18.91ms 52.0 145KB 13.0 20 5 384ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
Run Code Online (Sandbox Code Playgroud)
(sqldf在这个例子中确实会影响性能,我欢迎改进查询:-)
如果您可以使用数据框,下面将为您提供当前速度最快且内存效率最高的方法(请参阅基准维基)。
我认为使用的方法combn()对我来说似乎是合理的。而且我真的不认为它会像所声称的那样迭代组合 18 次。而且,我个人发现这比数据表熔化版本更容易阅读,但这可能是因为我不习惯 data.table 语法。
注意:在数据表上使用它显然效率不高。如果您确实需要 data.table,r2evans 解决方案更好。
fun2 <- function(data, ID.cols){
ids <- which(colnames(data) %in% ID.cols)
## you can loop over the combinations directly
new_dat <- combn(data[-ids], 2, function(x) {
new_x <- setNames(x, paste("value", c("left", "right"), sep = "."))
## use paste with collapse for the ID.new
new_x$ID.new <- paste(names(x), collapse = " - ")
new_x
}, simplify = FALSE)
## bind it with the old ID columns, outside the loop (bit faster)
cbind(do.call(rbind, new_dat), data[ids])
}
fun2(DT,ID.cols = c("ID1", "ID2"))
#> value.left value.right ID.new ID1 ID2
#> 1 10 7 NAME1 - NAME2 A 1
#> 2 11 9 NAME1 - NAME2 A 2
#> 3 9 8 NAME1 - NAME2 A 3
#> 4 22 20 NAME1 - NAME2 B 1
#> 5 25 22 NAME1 - NAME2 B 2
#> 6 22 21 NAME1 - NAME2 B 3
#> 7 10 10 NAME1 - NAME3 A 1
#> 8 11 12 NAME1 - NAME3 A 2
#> 9 9 11 NAME1 - NAME3 A 3
#> 10 22 15 NAME1 - NAME3 B 1
#> 11 25 19 NAME1 - NAME3 B 2
#> 12 22 30 NAME1 - NAME3 B 3
#> 13 7 10 NAME2 - NAME3 A 1
#> 14 9 12 NAME2 - NAME3 A 2
#> 15 8 11 NAME2 - NAME3 A 3
#> 16 20 15 NAME2 - NAME3 B 1
#> 17 22 19 NAME2 - NAME3 B 2
#> 18 21 30 NAME2 - NAME3 B 3
Run Code Online (Sandbox Code Playgroud)
有关基准测试,请参阅社区 wiki。