Jon*_*han 3 performance r processing-efficiency dataframe sapply
我在R中编写了一个函数来计算按月数计算的累计总数,但是随着数据集变大,我的方法的执行时间呈指数级增长.我是新手R程序员,你能帮我提高效率吗?
函数和我调用函数的方式:
accumulate <- function(recordnum,df){
sumthese <- (df$subject == df$subject[recordnum]) &
(df$month <= df$month[recordnum])
sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
Run Code Online (Sandbox Code Playgroud)
输入数据帧:
> df
measurement subject month
1 0.4577418 dog 5
2 0.7191123 dog 4
3 0.9346722 dog 3
4 0.2554288 dog 2
5 0.4622928 dog 1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
Run Code Online (Sandbox Code Playgroud)
输出数据帧:
> df
measurement subject month cumulative
1 0.9148060 dog 5 3.6102141
2 0.9370754 dog 4 2.6954081
3 0.2861395 dog 3 1.7583327
4 0.8304476 dog 2 1.4721931
5 0.6417455 dog 1 0.6417455
6 0.5190959 cat 5 2.7524079
7 0.7365883 cat 4 2.2333120
8 0.1346666 cat 3 1.4967237
9 0.6569923 cat 2 1.3620571
10 0.7050648 cat 1 0.7050648
Run Code Online (Sandbox Code Playgroud)
请注意,累积列显示所有测量值的累计,包括当前月份.该函数不需要对数据帧进行排序.当数据长度等于100时,经过的时间为0.3.1000是0.58.10,000 = 27.72.我需要这个运行200K +记录.
谢谢!
dplyr 会让这很容易
library(dplyr)
df %>%
group_by(subject) %>%
arrange(month) %>%
mutate(cumulative = cumsum(measurement))
Source: local data frame [10 x 4]
Groups: subject
measurement subject month cumulative
1 0.7050648 cat 1 0.7050648
2 0.6569923 cat 2 1.3620571
3 0.1346666 cat 3 1.4967237
4 0.7365883 cat 4 2.2333120
5 0.5190959 cat 5 2.7524079
6 0.6417455 dog 1 0.6417455
7 0.8304476 dog 2 1.4721931
8 0.2861395 dog 3 1.7583327
9 0.9370754 dog 4 2.6954081
10 0.9148060 dog 5 3.6102141
Run Code Online (Sandbox Code Playgroud)
虽然如果您正在寻找绝对性能,您可能想要使用 data.table
library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]
# measurement subject month cumulative
# 1: 0.7050648 cat 1 0.7050648
# 2: 0.6569923 cat 2 1.3620571
# 3: 0.1346666 cat 3 1.4967237
# 4: 0.7365883 cat 4 2.2333120
# 5: 0.5190959 cat 5 2.7524079
# 6: 0.6417455 dog 1 0.6417455
# 7: 0.8304476 dog 2 1.4721931
# 8: 0.2861395 dog 3 1.7583327
# 9: 0.9370754 dog 4 2.6954081
# 10: 0.9148060 dog 5 3.6102141
Run Code Online (Sandbox Code Playgroud)