我想在data.table中添加一个新列,其中包含Data1基于的累积产品Date.应为每个类别(Cat)计算累积产品,并应从最新的可用产品开始Date.
样本数据:
DF = data.frame(Cat=rep(c("A","B"),each=4), Date=rep(c("01-08-2013","01-07-2013","01-04-2013","01-03-2013"),2), Data1=c(1:8))
DF$Date = as.Date(DF$Date , "%m-%d-%Y")
DT = data.table(DF)
DT[ , Data1_cum:=NA_real_]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 NA
2: A 2013-01-07 2 NA
3: A 2013-01-04 3 NA
4: A 2013-01-03 4 NA
5: B 2013-01-08 5 NA
6: B 2013-01-07 6 NA
7: B 2013-01-04 7 NA
8: B 2013-01-03 8 NA
Run Code Online (Sandbox Code Playgroud)
结果应如下所示:
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
Run Code Online (Sandbox Code Playgroud)
我发现我可以做类似的事情cumprod(),但我不知道如何处理类别.NAs in Data1应该被忽略/视为1.真正的数据集有大约800万行和1000个类别.
如果唯一的外观问题是订购......
DT[order(Date, decreasing=TRUE), Data1_cum := cumprod(Data1), by=Cat]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
Run Code Online (Sandbox Code Playgroud)
注意:如果您打乱行的顺序,则结果可能会有所不同。小心执行order(.)命令的方式
## Let's add some NA values
DT <- rbind(DT, DT)
DT[c(2, 6, 11, 15), Data1 := NA]
# shuffle the rows, to make sure this is right
set.seed(1)
DT <- DT[sample(nrow(DT))]
Run Code Online (Sandbox Code Playgroud)
分配累积乘积:
## If you want to leave the NA's as NA's in the cum prod, use:
DT[ , Data1_cum := NA_real_ ]
DT[ intersect(order(Date, decreasing=TRUE), which(!is.na(Data1)))
, Data1_cum := cumprod(Data1)
, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
Run Code Online (Sandbox Code Playgroud)
## If instead you want to treat the NA's as 1, use:
DT[order(Date, decreasing=TRUE), Data1_cum := {Data1[is.na(Data1)] <- 1; cumprod(Data1 [order(Date, decreasing=TRUE)] )}, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA 2 <~~~~~~~ Rows with NA took on values of the previous Row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA 6 <~~~~~~~ Rows with NA took on values of the previous Row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA 150 <~~~~~~~ Rows with NA took on values of the previous Row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA 1050 <~~~~~~~ Rows with NA took on values of the previous Row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
Run Code Online (Sandbox Code Playgroud)
或者,如果您已经有了累积乘积并且只想删除 NA,您可以按如下方式操作:
# fix the NA's with the previous value
DT[order(Date, decreasing=TRUE),
Data1_cum := {tmp <- c(0, head(Data1_cum, -1));
Data1_cum[is.na(Data1_cum)] <- tmp[is.na(Data1_cum)];
Data1_cum }
, by=Cat ]
Run Code Online (Sandbox Code Playgroud)