如何根据列名子集的成对组合创建新的数据表?

Per*_*nkf 11 r data.table

我正在尝试定义一个函数,该函数将数据框或表作为具有特定数量的 ID 列(例如,2 或 3 个 ID 列)的输入,其余列是 NAME1、NAME2、...、NAMEK(数字列) )。输出应该是一个数据表,其中包含与之前相同的 ID 列以及一个附加 ID 列,该 ID 列对列名称的每个唯一的成对组合进行分组(NAME1、NAME2、...)。另外,我们必须根据ID列将数字列的实际值收集到两个新列中;具有两个 ID 列和三个数字列的示例:

ID1 <- c("A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3)
NAME1 <- c(10,11,9,22,25,22)
NAME2 <- c(7,9,8,20,22,21)
NAME3 <- c(10,12,11,15,19,30)
DT <- data.table(ID1,ID2,NAME1,NAME2,NAME3)
Run Code Online (Sandbox Code Playgroud)

我希望以 DT 作为输入的函数的输出为

ID.new <- c("NAME1 - NAME2","NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME2",
 "NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME3", "NAME1 - NAME3",
 "NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3",
 "NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3",
 "NAME2 - NAME3", "NAME2 - NAME3")
ID1 <- c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
value.left <- c(10,11,9,22,25,22,10,11,9,22,25,22,7,9,8,20,22,21)
value.right <- c(7,9,8,20,22,21,10,12,11,15,19,30,10,12,11,15,19,30)
DT.output <- data.table(ID.new,ID1,ID2,value.left,value.right)
Run Code Online (Sandbox Code Playgroud)

我发现 fun() (见下文)可以完成这项工作,但对于我来说太慢了:

  fun <- function(data, ID.cols){
   data <- data.table(data)
   # Which of the columns are ID columns
   ids <-  which(colnames(data) %in% ID.cols)
   # Obtain all pairwise combinations of numeric columns into a list
   numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
   id.cols <- data[,ids, with = FALSE]
   # bind the ID columns to each pairwise combination of numeric columns inside the list
   bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x)) 
   # Create generic names for the numeric columns so that rbindlist() may be applied. In addition we make a new column that groups based on which columns we are considering
   generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x) 
   setattr(x = x[,ID.NEW:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name = 
   'names', value = c(ID.cols,"value.left","value.right","ID.NEW"))))
   return(rbindlist(l=generalize))
}

# Performance
print(microbenchmark(fun(DT,ID.cols=c("ID1","ID2")),times=1000))
Run Code Online (Sandbox Code Playgroud)

有没有更快、更优雅的方法来做到这一点?

r2e*_*ans 7

熔化的自连接选项:

library(data.table)
DTlong <- melt(DT, id.vars = c("ID1", "ID2"), variable.factor = FALSE)
out <- DTlong[DTlong, on = .(ID1, ID2), allow.cartesian = TRUE
  ][variable < i.variable,
  ][, .(ID.new = paste(variable, i.variable, sep = " - "),
        ID1, ID2, value.left = value, value.right = i.value)]
out
#            ID.new    ID1   ID2 value.left value.right
#            <char> <char> <num>      <num>       <num>
#  1: NAME1 - NAME2      A     1         10           7
#  2: NAME1 - NAME2      A     2         11           9
#  3: NAME1 - NAME2      A     3          9           8
#  4: NAME1 - NAME2      B     1         22          20
#  5: NAME1 - NAME2      B     2         25          22
#  6: NAME1 - NAME2      B     3         22          21
#  7: NAME1 - NAME3      A     1         10          10
#  8: NAME2 - NAME3      A     1          7          10
#  9: NAME1 - NAME3      A     2         11          12
# 10: NAME2 - NAME3      A     2          9          12
# 11: NAME1 - NAME3      A     3          9          11
# 12: NAME2 - NAME3      A     3          8          11
# 13: NAME1 - NAME3      B     1         22          15
# 14: NAME2 - NAME3      B     1         20          15
# 15: NAME1 - NAME3      B     2         25          19
# 16: NAME2 - NAME3      B     2         22          19
# 17: NAME1 - NAME3      B     3         22          30
# 18: NAME2 - NAME3      B     3         21          30

### validation
setorder(out, ID.new, ID1, ID2)
identical(DT.output, out)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)

当然,这种方法combn是合理的想法,但它唯一的低效率是每个组合都会迭代一次。也就是说,combn(..., FUN=)在本例中,传递给的函数被调用了 18 次;如果你的数据更大,它会被调用很多次。然而,在此处的 /join 的情况下merge,一切都是以我们可以管理的矢量化方式完成的:merge高效地完成,过滤作为单个逻辑向量返回,并且也是paste(..)一个大向量。

公平地说,合并概念确实有其自身的低效率:由于笛卡尔连接,它最初生成 54 行。对于更大的数据,这将导致内存耗尽问题。如果您遇到这种情况,使用fuzzyjoin和包含variable < variable(LHS 与 RHS)可能会有所帮助,这应该可以减少(如果不能完全消除)问题。

最后的建议也可以通过以下方式完成sqldf

sqldf::sqldf("
  select t1.variable || ' - ' || t2.variable as [ID.new], t1.ID1, t1.ID2, 
    t1.value as [value.left], t2.value as [value.right]
  from DTlong t1
    join DTlong t2 on t1.ID1=t2.ID1 and t1.ID2=t2.ID2
      and t1.variable < t2.variable")
#           ID.new ID1 ID2 value.left value.right
# 1  NAME1 - NAME2   A   1         10           7
# 2  NAME1 - NAME3   A   1         10          10
# 3  NAME1 - NAME2   A   2         11           9
# 4  NAME1 - NAME3   A   2         11          12
# 5  NAME1 - NAME2   A   3          9           8
# 6  NAME1 - NAME3   A   3          9          11
# 7  NAME1 - NAME2   B   1         22          20
# 8  NAME1 - NAME3   B   1         22          15
# 9  NAME1 - NAME2   B   2         25          22
# 10 NAME1 - NAME3   B   2         25          19
# 11 NAME1 - NAME2   B   3         22          21
# 12 NAME1 - NAME3   B   3         22          30
# 13 NAME2 - NAME3   A   1          7          10
# 14 NAME2 - NAME3   A   2          9          12
# 15 NAME2 - NAME3   A   3          8          11
# 16 NAME2 - NAME3   B   1         20          15
# 17 NAME2 - NAME3   B   2         22          19
# 18 NAME2 - NAME3   B   3         21          30
Run Code Online (Sandbox Code Playgroud)

基准测试:

bench::mark(
  pernkf  = fun(DT, c("ID1", "ID2")),
  tjebo   = fun2(DT, c("ID1", "ID2")),
  r2evans = fun3(DT, c("ID1", "ID2")),  # native data.table
  r2evans2 = fun4(),                    # sqldf
  check = FALSE)
# # A tibble: 4 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time     gc        
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>   <list>    
# 1 pernkf       5.38ms   6.06ms     161.      287KB    13.2     61     5      379ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 2 tjebo        5.08ms   5.63ms     172.      230KB     8.83    78     4      453ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 3 r2evans      2.97ms   3.48ms     280.      170KB    11.0    127     5      454ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 4 r2evans2    17.19ms  18.91ms      52.0     145KB    13.0     20     5      384ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
Run Code Online (Sandbox Code Playgroud)

sqldf在这个例子中确实会影响性能,我欢迎改进查询:-)


Tje*_*ebo 3

如果您可以使用数据框,下面将为您提供当前速度最快且内存效率最高的方法(请参阅基准维基)。

我认为使用的方法combn()对我来说似乎是合理的。而且我真的不认为它会像所声称的那样迭代组合 18 次。而且,我个人发现这比数据表熔化版本更容易阅读,但这可能是因为我不习惯 data.table 语法。

注意:在数据表上使用它显然效率不高。如果您确实需要 data.table,r2evans 解决方案更好。

fun2 <- function(data, ID.cols){
  ids <-  which(colnames(data) %in% ID.cols)
  ## you can loop over the combinations directly
  new_dat <- combn(data[-ids], 2, function(x) {
    new_x <- setNames(x, paste("value", c("left", "right"), sep = "."))
    ## use paste with collapse for the ID.new
    new_x$ID.new <- paste(names(x), collapse = " - ")
    new_x
  }, simplify = FALSE)

## bind it with the old ID columns, outside the loop (bit faster)
  cbind(do.call(rbind, new_dat), data[ids])
}

fun2(DT,ID.cols = c("ID1", "ID2"))
#>    value.left value.right        ID.new ID1 ID2
#> 1          10           7 NAME1 - NAME2   A   1
#> 2          11           9 NAME1 - NAME2   A   2
#> 3           9           8 NAME1 - NAME2   A   3
#> 4          22          20 NAME1 - NAME2   B   1
#> 5          25          22 NAME1 - NAME2   B   2
#> 6          22          21 NAME1 - NAME2   B   3
#> 7          10          10 NAME1 - NAME3   A   1
#> 8          11          12 NAME1 - NAME3   A   2
#> 9           9          11 NAME1 - NAME3   A   3
#> 10         22          15 NAME1 - NAME3   B   1
#> 11         25          19 NAME1 - NAME3   B   2
#> 12         22          30 NAME1 - NAME3   B   3
#> 13          7          10 NAME2 - NAME3   A   1
#> 14          9          12 NAME2 - NAME3   A   2
#> 15          8          11 NAME2 - NAME3   A   3
#> 16         20          15 NAME2 - NAME3   B   1
#> 17         22          19 NAME2 - NAME3   B   2
#> 18         21          30 NAME2 - NAME3   B   3
Run Code Online (Sandbox Code Playgroud)

有关基准测试,请参阅社区 wiki