xia*_*dai 58 r plyr dplyr data.table
我有一个data.table:
set.seed(1)
data <- data.table(time = c(1:3, 1:4),
groups = c(rep(c("b", "a"), c(3, 4))),
value = rnorm(7))
data
# groups time value
# 1: b 1 -0.6264538
# 2: b 2 0.1836433
# 3: b 3 -0.8356286
# 4: a 1 1.5952808
# 5: a 2 0.3295078
# 6: a 3 -0.8204684
# 7: a 4 0.4874291
Run Code Online (Sandbox Code Playgroud)
我想在每个"组"级别内计算"值"列的滞后版本.
结果应该是这样的
# groups time value lag.value
# 1 a 1 1.5952808 NA
# 2 a 2 0.3295078 1.5952808
# 3 a 3 -0.8204684 0.3295078
# 4 a 4 0.4874291 -0.8204684
# 5 b 1 -0.6264538 NA
# 6 b 2 0.1836433 -0.6264538
# 7 b 3 -0.8356286 0.1836433
Run Code Online (Sandbox Code Playgroud)
我试过lag
直接使用:
data$lag.value <- lag(data$value)
Run Code Online (Sandbox Code Playgroud)
......显然不行.
我也尝试过:
unlist(tapply(data$value, data$groups, lag))
a1 a2 a3 a4 b1 b2 b3
NA -0.1162932 0.4420753 2.1505440 NA 0.5894583 -0.2890288
Run Code Online (Sandbox Code Playgroud)
这几乎是我想要的.但是,生成的向量的排序与data.table中的排序不同,这是有问题的.
在基础R,plyr,dplyr和data.table中执行此操作的最有效方法是什么?
akr*_*run 86
你可以在里面做到这一点 data.table
library(data.table)
data[, lag.value:=c(NA, value[-.N]), by=groups]
data
# time groups value lag.value
#1: 1 a 0.02779005 NA
#2: 2 a 0.88029938 0.02779005
#3: 3 a -1.69514201 0.88029938
#4: 1 b -1.27560288 NA
#5: 2 b -0.65976434 -1.27560288
#6: 3 b -1.37804943 -0.65976434
#7: 4 b 0.12041778 -1.37804943
Run Code Online (Sandbox Code Playgroud)
对于多列:
nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
data
# time groups value value1 value2 lag.value lag.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 NA NA
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247
#3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814
#4: 1 a 1.5952808 1.5117812 0.94383621 NA NA
#5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432
#7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406
# lag.value2
#1: NA
#2: 1.12493092
#3: -0.04493361
#4: NA
#5: 0.94383621
#6: 0.82122120
#7: 0.59390132
Run Code Online (Sandbox Code Playgroud)
从data.table
版本> = v1.9.5
,我们可以使用shift
与type
作为lag
或lead
.默认情况下,类型为lag
.
data[, (nm2) := shift(.SD), by=groups, .SDcols=nm1]
# time groups value value1 value2 lag.value lag.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 NA NA
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247
#3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814
#4: 1 a 1.5952808 1.5117812 0.94383621 NA NA
#5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432
#7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406
# lag.value2
#1: NA
#2: 1.12493092
#3: -0.04493361
#4: NA
#5: 0.94383621
#6: 0.82122120
#7: 0.59390132
Run Code Online (Sandbox Code Playgroud)
如果您需要反向,请使用 type=lead
nm3 <- paste("lead", nm1, sep=".")
Run Code Online (Sandbox Code Playgroud)
使用原始数据集
data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
# time groups value value1 value2 lead.value lead.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 0.1836433 0.5757814
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.8356286 -0.3053884
#3: 3 b -0.8356286 -0.3053884 -0.01619026 NA NA
#4: 1 a 1.5952808 1.5117812 0.94383621 0.3295078 0.3898432
#5: 2 a 0.3295078 0.3898432 0.82122120 -0.8204684 -0.6212406
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.4874291 -2.2146999
#7: 4 a 0.4874291 -2.2146999 0.91897737 NA NA
# lead.value2
#1: -0.04493361
#2: -0.01619026
#3: NA
#4: 0.82122120
#5: 0.59390132
#6: 0.91897737
#7: NA
Run Code Online (Sandbox Code Playgroud)
set.seed(1)
data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
value = rnorm(7), value1=rnorm(7), value2=rnorm(7))
Run Code Online (Sandbox Code Playgroud)
Ale*_*lex 68
使用包dplyr
:
library(dplyr)
data <-
data %>%
group_by(groups) %>%
mutate(lag.value = dplyr::lag(value, n = 1, default = NA))
Run Code Online (Sandbox Code Playgroud)
给
> data
Source: local data table [7 x 4]
Groups: groups
time groups value lag.value
1 1 a 0.07614866 NA
2 2 a -0.02784712 0.07614866
3 3 a 1.88612245 -0.02784712
4 1 b 0.26526825 NA
5 2 b 1.23820506 0.26526825
6 3 b 0.09276648 1.23820506
7 4 b -0.09253594 0.09276648
Run Code Online (Sandbox Code Playgroud)
正如@BrianD所指出的,这隐含地假设值已经按组排序.如果不是,请按组排序,或使用order_by
参数lag
.另请注意,由于某些版本的dplyr 存在问题,为了安全起见,应明确给出参数和命名空间.
我想通过提到我在重要情况下解决这个问题的两种方法来补充之前的答案,当您不能保证每个组都有每个时间段的数据时。也就是说,您仍然有一个定期间隔的时间序列,但可能到处都有缺失。我将重点介绍改进dplyr
解决方案的两种方法。
我们从您使用的相同数据开始...
library(dplyr)
library(tidyr)
set.seed(1)
data_df = data.frame(time = c(1:3, 1:4),
groups = c(rep(c("b", "a"), c(3, 4))),
value = rnorm(7))
data_df
#> time groups value
#> 1 1 b -0.6264538
#> 2 2 b 0.1836433
#> 3 3 b -0.8356286
#> 4 1 a 1.5952808
#> 5 2 a 0.3295078
#> 6 3 a -0.8204684
#> 7 4 a 0.4874291
Run Code Online (Sandbox Code Playgroud)
...但现在我们删除了几行
data_df = data_df[-c(2, 6), ]
data_df
#> time groups value
#> 1 1 b -0.6264538
#> 3 3 b -0.8356286
#> 4 1 a 1.5952808
#> 5 2 a 0.3295078
#> 7 4 a 0.4874291
Run Code Online (Sandbox Code Playgroud)
dplyr
解决方案不再有效data_df %>%
arrange(groups, time) %>%
group_by(groups) %>%
mutate(lag.value = lag(value)) %>%
ungroup()
#> # A tibble: 5 x 4
#> time groups value lag.value
#> <int> <fct> <dbl> <dbl>
#> 1 1 a 1.60 NA
#> 2 2 a 0.330 1.60
#> 3 4 a 0.487 0.330
#> 4 1 b -0.626 NA
#> 5 3 b -0.836 -0.626
Run Code Online (Sandbox Code Playgroud)
您会看到,虽然我们没有 case 的值(group = 'a', time = '3')
,但上面仍然显示了 的情况下的滞后值(group = 'a', time = '4')
,这实际上是 处的值time = 2
。
dplyr
解决办法这个想法是我们添加缺失的(组,时间)组合。当您有很多可能的(组、时间)组合时,这是非常内存效率低下的,但这些值被稀疏地捕获。
dplyr_correct_df = expand.grid(
groups = sort(unique(data_df$groups)),
time = seq(from = min(data_df$time), to = max(data_df$time))
) %>%
left_join(data_df, by = c("groups", "time")) %>%
arrange(groups, time) %>%
group_by(groups) %>%
mutate(lag.value = lag(value)) %>%
ungroup()
dplyr_correct_df
#> # A tibble: 8 x 4
#> groups time value lag.value
#> <fct> <int> <dbl> <dbl>
#> 1 a 1 1.60 NA
#> 2 a 2 0.330 1.60
#> 3 a 3 NA 0.330
#> 4 a 4 0.487 NA
#> 5 b 1 -0.626 NA
#> 6 b 2 NA -0.626
#> 7 b 3 -0.836 NA
#> 8 b 4 NA -0.836
Run Code Online (Sandbox Code Playgroud)
请注意,我们现在在 处有一个 NA (group = 'a', time = '4')
,这应该是预期的行为。与(group = 'b', time = '3')
.
zoo::zooreg
当案例数量非常大时,此解决方案在内存方面应该会更好,因为它使用索引而不是用 NA 填充缺失的案例。
library(zoo)
zooreg_correct_df = data_df %>%
as_tibble() %>%
# nest the data for each group
# should work for multiple groups variables
nest(-groups, .key = "zoo_ob") %>%
mutate(zoo_ob = lapply(zoo_ob, function(d) {
# create zooreg objects from the individual data.frames created by nest
z = zoo::zooreg(
data = select(d,-time),
order.by = d$time,
frequency = 1
) %>%
# calculate lags
# we also ask for the 0'th order lag so that we keep the original value
zoo:::lag.zooreg(k = (-1):0) # note the sign convention is different
# recover df's from zooreg objects
cbind(
time = as.integer(zoo::index(z)),
zoo:::as.data.frame.zoo(z)
)
})) %>%
unnest() %>%
# format values
select(groups, time, value = value.lag0, lag.value = `value.lag-1`) %>%
arrange(groups, time) %>%
# eliminate additional periods created by lag
filter(time <= max(data_df$time))
zooreg_correct_df
#> # A tibble: 8 x 4
#> groups time value lag.value
#> <fct> <int> <dbl> <dbl>
#> 1 a 1 1.60 NA
#> 2 a 2 0.330 1.60
#> 3 a 3 NA 0.330
#> 4 a 4 0.487 NA
#> 5 b 1 -0.626 NA
#> 6 b 2 NA -0.626
#> 7 b 3 -0.836 NA
#> 8 b 4 NA -0.836
Run Code Online (Sandbox Code Playgroud)
最后,让我们检查两个正确的解决方案实际上是否相等:
all.equal(dplyr_correct_df, zooreg_correct_df)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)
小智 5
在基地R,这将完成工作:
data$lag.value <- c(NA, data$value[-nrow(data)])
data$lag.value[which(!duplicated(data$groups))] <- NA
Run Code Online (Sandbox Code Playgroud)
第一行添加了一串滞后(+1)观测值.第二个字符串更正每个组的第一个条目,因为滞后观察来自前一个组.
请注意,data
格式data.frame
不使用data.table
.
归档时间: |
|
查看次数: |
56649 次 |
最近记录: |