用向量计算替换函数中的循环以加速R

col*_*lin 4 r

假设我在数据框中有一些数据d1,它描述了不同样本个体吃不同食物的频率,以及描述这些食物是否凉爽的最后一栏.数据结构如下.

OTU.ID<- c('pizza','taco','pizza.taco','dirt')
s1<-c(5,20,14,70)
s2<-c(99,2,29,5)
s3<-c(44,44,33,22)
cool<-c(1,1,1,0)

d1<-data.frame(OTU.ID,s1,s2,s3,cool)
print(d1)
      OTU.ID s1 s2 s3 cool
1      pizza  5 99 44    1
2       taco 20  2 44    1
3 pizza.taco 14 29 33    1
4       dirt 70  5 22    0
Run Code Online (Sandbox Code Playgroud)

我写了一个函数,对于每个样本,s1:s3消耗的凉爽食物的数量,以及消耗的食物总数.它在数据表的每一行上作为for循环运行(这非常慢).

cool.food.abundance<- function(food.table){
samps<-colnames(food.table)
#remove column names that are not sample names
samps<-samps[!samps %in% c("OTU.ID","cool")]

#create output vectors for for loop
    id<-c()
    cool.foods<-c()
    all.foods<-c()
    #run a loop that stores output ids and results as vectors
    for(i in 1:length(samps)){
        x<- samps[i]
        y1<-sum(food.table[samps[i]]*food.table$cool)
        y2<-sum(food.table[samps[i]])
        id<-c(id,x)
        cool.foods<-c(cool.foods,y1)
        all.foods<-c(all.foods,y2)
    }
    #save results as a data frame and return the data frame object
    results<-data.frame(id,cool.foods,all.foods)
    return(results)
}
Run Code Online (Sandbox Code Playgroud)

因此,如果您运行此功能,您将获得一个新的样品ID表,采样的冷食品数量以及采样的食品总数.

cool.food.abundance(d1)
  id cool.foods all.foods
1 s1         39       109
2 s2        130       135
3 s3        121       143
Run Code Online (Sandbox Code Playgroud)

如何使用矢量计算替换此for循环以加快速度?我真的希望能够对fread函数在data.table包中加载函数的数据帧进行操作.

akr*_*run 5

你可以试试

library(data.table)#v1.9.5+
dcast(melt(setDT(d1), id.var=c('OTU.ID', 'cool'))[,
         sum(value) ,.(cool, variable)], variable~c('notcool.foods',
       'cool.foods')[cool+1L], value.var='V1')[,
    all.foods:= cool.foods+notcool.foods][, notcool.foods:=NULL]
#      variable cool.foods all.foods
#1:       s1         39       109
#2:       s2        130       135
#3:       s3        121       143
Run Code Online (Sandbox Code Playgroud)

或者不使用dcast我们可以总结结果(如在@ jeremycg的帖子中),因为只有两个组

 melt(setDT(d1), id.var=c('OTU.ID', 'cool'), variable.name='id')[,
     list(all.foods=sum(value), cool.foods=sum(value[cool==1])) , id]
 #   id all.foods cool.foods
 #1: s1       109         39
 #2: s2       135        130
 #3: s3       143        121
Run Code Online (Sandbox Code Playgroud)

或者你可以使用 base R

nm1 <- paste0('s', 1:3)
res <- t(addmargins(rowsum(as.matrix(d1[nm1]), group=d1$cool),1)[-1,])

colnames(res) <- c('cool.foods', 'all.foods')
res
 #   cool.foods all.foods
 #s1         39       109
 #s2        130       135
 #s3        121       143
Run Code Online (Sandbox Code Playgroud)