通过id变量计算滚动总和,缺少时间点

ADJ*_*ADJ 14 r sas plyr zoo

我正在努力学习R并且我已经在SAS工作了10多年,但我无法找到最好的R方法.拿这些数据:

 id  class           t  count  desired
 --  -----  ----------  -----  -------
  1      A  2010-01-15      1        1
  1      A  2010-02-15      2        3
  1      B  2010-04-15      3        3
  1      B  2010-09-15      4        4
  2      A  2010-01-15      5        5
  2      B  2010-06-15      6        6
  2      B  2010-08-15      7       13
  2      B  2010-09-15      8       21
Run Code Online (Sandbox Code Playgroud)

我想通过id,class和4个月的滚动窗口计算所需的列作为滚动总和.请注意,对于id和class的每个组合,并非所有月份都存在.

在SAS中,我通常采用以下两种方式之一:

  1. RETAIN 加上一个id和class.
  2. PROC SQL 左边连接从df作为df1到df作为df2在id,class和df1.d-df2.d在相应的窗口中

解决此类问题的最佳方法是什么?

t <- as.Date(c("2010-01-15","2010-02-15","2010-04-15","2010-09-15",
               "2010-01-15","2010-06-15","2010-08-15","2010-09-15"))
class <- c("A","A","B","B","A","B","B","B")
id <- c(1,1,1,1,2,2,2,2)
count <- seq(1,8,length.out=8)
desired <- c(1,3,3,4,5,6,13,21)
df <- data.frame(id,class,t,count,desired)
Run Code Online (Sandbox Code Playgroud)

G. *_*eck 18

以下是一些解决方案:

1)zoo使用ave,为每个组创建一个月系列m,通过合并原始系列z,与网格,g.然后计算滚动总和并仅保留原始时间点:

library(zoo)
f <- function(i) { 
    z <- with(df[i, ], zoo(count, t))
    g <- zoo(, seq(start(z), end(z), by = "month"))
    m <- merge(z, g)
    window(rollapplyr(m, 4, sum, na.rm = TRUE, partial = TRUE), time(z))
}
df$desired <- ave(1:nrow(df), df$id, df$class, FUN = f)
Run Code Online (Sandbox Code Playgroud)

这使:

> df
  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21
Run Code Online (Sandbox Code Playgroud)

注意我们假设每个组内都按时间排序(如问题所示).如果不是这样,那么df先排序.

2)sqldf

library(sqldf)
sqldf("select id, class, a.t, a.'count', sum(b.'count') desired 
   from df a join df b 
   using(id, class) 
   where a.t - b.t between 0 and 100
   group by id, class, a.t")
Run Code Online (Sandbox Code Playgroud)

这使:

  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21
Run Code Online (Sandbox Code Playgroud)

注意: 如果合并应该太大而无法放入内存中,那么请使用sqldf("...", dbname = tempfile())以使中间结果存储在动态创建的数据库中,然后自动销毁.

3)Base R sqldf解决方案激发了这个基本R解决方案,它只是将SQL转换为R:

m <- merge(df, df, by = 1:2)
s <- subset(m, t.x - t.y >= 0 & t.x - t.y <= 100)
ag <- aggregate(count.y ~ t.x + class + id, s, sum)
names(ag) <- c("t", "class", "id", "count", "desired")
Run Code Online (Sandbox Code Playgroud)

结果是:

> ag
           t class id count desired
1 2010-01-15     A  1     1       1
2 2010-02-15     A  1     2       3
3 2010-04-15     B  1     3       3
4 2010-09-15     B  1     4       4
5 2010-01-15     A  2     5       5
6 2010-06-15     B  2     6       6
7 2010-08-15     B  2     7      13
8 2010-09-15     B  2     8      21
Run Code Online (Sandbox Code Playgroud)

注意:这确实在内存中进行合并,如果数据集非常大,则可能会出现问题.

更新:第一个解决方案的简化,并添加了第二个解决方案.

更新2:添加第三个解决方案.

  • 还要感谢您对`zoo`包的工作 - 非常感谢! (3认同)

Aar*_*ica 5

发布这个我几乎很尴尬.我通常都很优秀,但必须有更好的方法.

这首先使用zoo's as.yearmon来获取月份和年份的日期,然后将其重新整形为每个id/ class组合获得一列,然后在之前,之后和缺失月份填充零,然后用于zoo获得滚动总和,然后拉出所需的月份,并与原始数据框合并.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21
Run Code Online (Sandbox Code Playgroud)