按 R 中的列名称组聚合矩阵（或 data.frame）

Question

按 R 中的列名称组聚合矩阵（或 data.frame）

我有一个大约 3000 列 x 3000 行的大矩阵。我想聚合（计算平均值）按每行的列名称分组。每列的命名与此方法类似......（并且按随机顺序）

 Tree Tree House House Tree Car Car House

Run Code Online (Sandbox Code Playgroud)

我需要数据结果（每行平均值的聚合）具有以下列：

  Tree House Car

Run Code Online (Sandbox Code Playgroud)

棘手的部分（至少对我来说）是我不知道所有列名称，并且它们都是随机顺序的！

Answer 1

akr*_*run 5

你可以尝试

res1 <- vapply(unique(colnames(m1)), function(x) 
      rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                             numeric(nrow(m1)) )

Run Code Online (Sandbox Code Playgroud)

或者

res2 <-  sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )

identical(res1,res2)
#[1] TRUE

Run Code Online (Sandbox Code Playgroud)

另一种选择可能是重塑为长形式，然后进行聚合

 library(data.table)
 res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
 identical(res1, as.matrix(res3))
 [1] TRUE

Run Code Online (Sandbox Code Playgroud)

基准测试

对于 3000*3000 矩阵来说，前两种方法似乎稍微快一些

set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE), 
   ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
    3000,replace=TRUE)))

library(microbenchmark)

f1 <-function() {vapply(unique(colnames(m1)), function(x) 
     rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                           numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}

f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
            Var1:= NULL]}

microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
#   Unit: relative
# expr      min       lq     mean   median       uq      max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004    10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531    10

Run Code Online (Sandbox Code Playgroud)

数据

 set.seed(24)
 m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10, 
     dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，1 月前
查看次数：	4823 次
最近记录：	5 年，7 月前