如何在R中优化sapply以计算数据帧上的运行总计

Jon*_*han 3 performance r processing-efficiency dataframe sapply

我在R中编写了一个函数来计算按月数计算的累计总数,但是随着数据集变大,我的方法的执行时间呈指数级增长.我是新手R程序员,你能帮我提高效率吗?
函数和我调用函数的方式:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
Run Code Online (Sandbox Code Playgroud)

输入数据帧:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1
Run Code Online (Sandbox Code Playgroud)

输出数据帧:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648
Run Code Online (Sandbox Code Playgroud)

请注意,累积列显示所有测量值的累计,包括当前月份.该函数不需要对数据帧进行排序.当数据长度等于100时,经过的时间为0.3.1000是0.58.10,000 = 27.72.我需要这个运行200K +记录.
谢谢!

cde*_*man 5

dplyr 会让这很容易

library(dplyr)
df %>%
    group_by(subject) %>%
    arrange(month) %>%
    mutate(cumulative = cumsum(measurement))

Source: local data frame [10 x 4]
Groups: subject

   measurement subject month cumulative
1    0.7050648     cat     1  0.7050648
2    0.6569923     cat     2  1.3620571
3    0.1346666     cat     3  1.4967237
4    0.7365883     cat     4  2.2333120
5    0.5190959     cat     5  2.7524079
6    0.6417455     dog     1  0.6417455
7    0.8304476     dog     2  1.4721931
8    0.2861395     dog     3  1.7583327
9    0.9370754     dog     4  2.6954081
10   0.9148060     dog     5  3.6102141
Run Code Online (Sandbox Code Playgroud)

虽然如果您正在寻找绝对性能,您可能想要使用 data.table

library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]    

#     measurement subject month cumulative
#  1:   0.7050648     cat     1  0.7050648
#  2:   0.6569923     cat     2  1.3620571
#  3:   0.1346666     cat     3  1.4967237
#  4:   0.7365883     cat     4  2.2333120
#  5:   0.5190959     cat     5  2.7524079
#  6:   0.6417455     dog     1  0.6417455
#  7:   0.8304476     dog     2  1.4721931
#  8:   0.2861395     dog     3  1.7583327
#  9:   0.9370754     dog     4  2.6954081
# 10:   0.9148060     dog     5  3.6102141
Run Code Online (Sandbox Code Playgroud)

  • 我会使用`setDT()`而不是`as.data.table()`来制作一个不必要的副本. (3认同)