sta*_*ant 4 r reshape data.table
又一个重塑问题了 data.table
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
# x y v
# 1: 1 A 12
# 2: 1 B 62
...
#11: 3 A 63
#12: 3 B 49
Run Code Online (Sandbox Code Playgroud)
我想这样做的累加值x和v通过y,但结果呈现为:行数始终保持不变,而当y==A在SUM.*.A递增,同一时候y==B.(像往常一样y可能有很多因素,本例中为2)
# SUM.x.A SUM.x.B SUM.v.A SUM.v.B
# 1: 1 NA 12 NA
# 2: 1 1 12 62
...
#11: 12 9 318 289
#12: 12 12 318 338
Run Code Online (Sandbox Code Playgroud)
编辑:这是我的糟糕解决方案显然过于复杂
#first step is to create cumsum columns
colNames <- c("x","v"); newColNames <- paste0("SUM.",colNames)
DT[, newColNames:=lapply(.SD,cumsum) ,by=y, .SDcols=colNames, with=F];
#now we need to reshape each SUM.* to SUM.*.{yvalue}
DT[,N:=.I]; setattr(DT,"sorted","N")
g <- function(DT,SD){
cols <- c('N',grep('SUM',colnames(SD), value=T));
Yval <- unique(SD[,y]);
merge(DT, SD[,cols, with=F], suffixe=c('',paste0('.',Yval)), all.x=T);
}
DT <- Reduce(f=g,init=DT,x=split(DT,DT$y));
locf = function(x) {
ind = which(!is.na(x))
if(is.na(x[1])) ind = c(1,ind)
rep(x[ind], times = diff( c(ind, length(x) + 1) ))
}
newColNames <- grep('SUM',colnames(DT),value=T);
DT <- DT[, (newColNames):=lapply(.SD, locf), .SDcols=newColNames]
Run Code Online (Sandbox Code Playgroud)
试试这个:
cumsum0 <- function(x) { x <- cumsum(x); ifelse(x == 0, NA, x) }
DT2 <- DT[, {SUM.<-y; lapply(data.table(model.matrix(~ SUM.:x + SUM.:v + 0)), cumsum0)}]
setnames(DT2, sub("(.):(.)", "\\2.\\1", names(DT2)))
Run Code Online (Sandbox Code Playgroud)
简化:
1)如果使用0代替NA是正确的,那么可以通过省略第一行来简化它,该第一行在下一行中定义cumsum0和替换. cumsum0cumsum
2)第二行的结果有以下名称:
> names(DT2)
[1] "SUM.A:x" "SUM.B:x" "SUM.A:v" "SUM.B:v"
Run Code Online (Sandbox Code Playgroud)
因此,如果这足够,则可以删除最后一行,因为它的唯一目的是使名称与问题中的名称完全相同.
结果(没有简化)是:
> DT2
SUM.x.A SUM.x.B SUM.v.A SUM.v.B
1: 1 NA 12 NA
2: 1 1 12 62
3: 2 1 72 62
4: 2 2 72 123
5: 4 2 155 123
6: 4 4 155 220
7: 6 4 156 220
8: 6 6 156 242
9: 9 6 255 242
10: 9 9 255 289
11: 12 9 318 289
12: 12 12 318 338
Run Code Online (Sandbox Code Playgroud)