Mai*_*ura 12 r plyr data.table
我有一个数据集,其标题如下所示:
PID Time Site Rep Count
Run Code Online (Sandbox Code Playgroud)
我想总结Count
通过Rep
对每个PID x Time x Site combo
对得到的data.frame,我想要得到的平均值Count
进行PID x Time x Site
组合.
目前的功能如下:
dummy <- function (data)
{
A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))})
B<-aggregate(Count~PID+Time+Site,data=A,mean)
return (B)
}
Run Code Online (Sandbox Code Playgroud)
这是非常缓慢的(原始data.frame是510000 20)
.有没有办法加快plyr的速度?
Ram*_*ath 22
您应该查看该包,data.table
以便在大型数据帧上进行更快的聚合操作.对于您的问题,解决方案将如下所示:
library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
Run Code Online (Sandbox Code Playgroud)
让我们看看它的速度data.table
和使用速度有多快dplyr
.这将大致是这样做的方式dplyr
.
data %>% group_by(PID, Time, Site, Rep) %>%
summarise(totalCount = sum(Count)) %>%
group_by(PID, Time, Site) %>%
summarise(mean(totalCount))
Run Code Online (Sandbox Code Playgroud)
或许这可能取决于问题的确切解释:
data %>% group_by(PID, Time, Site) %>%
summarise(totalCount = sum(Count), meanCount = mean(Count)
Run Code Online (Sandbox Code Playgroud)
以下是这些替代方案的完整示例,而不是@Ramnath提出的答案和@David Arenburg在评论中提出的,我认为这相当于第二个dplyr
陈述.
nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE),
Time = sample(letters, nrow, replace = TRUE),
Site = sample(letters, nrow, replace = TRUE),
Rep = rnorm(nrow),
Count = rpois(nrow, 100))
library(dplyr)
library(data.table)
Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
summarise(totalCount = sum(Count)) %>%
group_by(PID, Time, Site) %>%
summarise(mean(totalCount))
Rprof()
summaryRprof(tf1) #reports 1.68 sec sampling time
Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
summarise(total = sum(Count), meanCount = mean(Count))
Rprof()
summaryRprof(tf2) # reports 1.60 seconds
Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3) #reports 0.06 seconds
Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4) #reports 0.02 seconds
Run Code Online (Sandbox Code Playgroud)
数据表方法要快得多,而且setDT
速度更快!