Mar*_*arl 3 time r apply dataframe lubridate
我有一个数据框,例如:
data <- data.frame("date" = c("2015-05-01 14:12:57",
"2015-05-01 14:14:57",
"2015-05-01 14:15:57",
"2015-05-01 14:42:57",
"2015-05-01 14:52:57"),
"Var1" = c(2,3,4,2,1),
"Var2" = c(0.53,0.3,0.34,0.12,0.91),
"Var3" = c(1,1,1,1,1))
data
date Var1 Var2 Var3
1 2015-05-01 14:12:57 2 0.53 1
2 2015-05-01 14:14:57 3 0.30 1
3 2015-05-01 14:15:57 4 0.34 1
4 2015-05-01 14:42:57 2 0.12 1
5 2015-05-01 14:52:57 1 0.91 1
Run Code Online (Sandbox Code Playgroud)
然而,实际上有60,000 行和 26 个变量!
我想要实现的是:
unix_timestamp Var1 Var2 Var3
1 2015-05-01 14:12:57 2.0 0.530 1
2 2015-05-01 14:14:57 2.5 0.415 2
3 2015-05-01 14:15:57 3.0 0.390 3
4 2015-05-01 14:42:57 2.0 0.120 1
5 2015-05-01 14:52:57 1.5 0.515 2
Run Code Online (Sandbox Code Playgroud)
理论上: 根据过去 15 分钟的观察结果计算每行数据的平均值(Var1 和 Var2 以及 Var3 的总和)。
我想出了:
library(lubridate)
data <- data.frame("date" = c("2015-05-01 14:12:57",
"2015-05-01 14:14:57",
"2015-05-01 14:15:57",
"2015-05-01 14:42:57",
"2015-05-01 14:52:57"),
"Var1" = c(2,3,4,2,1),
"Var2" = c(0.53,0.3,0.34,0.12,0.91),
"Var3" = c(1,1,1,1,1))
pre <- vector("list", nrow(data))
for (i in 1:length(pre)) {
#to see progress
print(paste(i, "of", nrow(data), sep = " "))
help <- data[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - minutes(15)) &
as.POSIXct(data[,1]) <= as.POSIXct(data[i,1]),] # Help data frame with time frame selection
chunk <- data.frame("unix_timestamp" = as.POSIXct(data[i,1]),
"Var1" = mean(help$Var1),
"Var2" = mean(help$Var2),
"Var3" = sum(help$Var3))
pre[[i]] <- chunk
}
output <- do.call(rbind, pre)
output
Run Code Online (Sandbox Code Playgroud)
...实际上返回期望的结果。但是,对于具有 60,000 行的数据框,这不起作用或需要 100 年(不要忘记我实际上有 26 个变量)。
有谁知道如何摆脱循环或如何调整我的功能?会非常感激!我也尝试过 sapply 但似乎它并没有快多少或者我做错了什么。
感谢您的任何帮助!
使用dplyr,我们可以转换date为POSIXct类别,使用cut将其分解为 15 分钟的间隔,然后获取各列的累积平均值和总和。
library(dplyr)
data %>%
group_by(group = cut(as.POSIXct(date), breaks = "15 mins")) %>%
mutate_at(vars(Var1, Var2), cummean) %>%
mutate_at(vars(Var3), cumsum) %>%
ungroup() %>%
select(-group)
# date Var1 Var2 Var3
# <fct> <dbl> <dbl> <dbl>
#1 2015-05-01 14:12:57 2 0.53 1
#2 2015-05-01 14:14:57 2.5 0.415 2
#3 2015-05-01 14:15:57 3 0.39 3
#4 2015-05-01 14:42:57 2 0.12 1
#5 2015-05-01 14:52:57 1.5 0.515 2
Run Code Online (Sandbox Code Playgroud)
使用mutate_at因为有 26 个变量,所以我们可以一次将相同的函数应用于多个列。
编辑
根据 @Rentrop 的评论,使用他的数据更新答案。
library(dplyr)
library(purrr)
dat %>%
mutate(date = as.POSIXct(date),
Var1 = map_dbl(date, ~mean(Var1[date >= (.x - (15 * 60)) & date <= .x])),
Var2 = map_dbl(date, ~mean(Var2[date >= (.x - (15 * 60)) & date <= .x])),
Var3 = map_dbl(date, ~sum(Var3[date >= (.x - (15 * 60)) & date <= .x])))
# date Var1 Var2 Var3
#1 2015-05-01 14:12:57 2.0 0.530 1
#2 2015-05-01 14:14:57 2.5 0.415 2
#3 2015-05-01 14:29:57 3.5 0.320 2
#4 2015-05-01 14:42:57 3.0 0.230 2
#5 2015-05-01 14:52:57 1.5 0.515 2
Run Code Online (Sandbox Code Playgroud)