gre*_*Bag 5 optimization r data.table
DT <- data.table(Id = c(1, 1, 1, 1, 10, 100, 100, 101, 101, 101),
Date = as.Date(c("1997-01-01", "1997-01-02", "1997-01-03", "1997-01-04",
"1997-01-02", "1997-01-02", "1997-01-04", "1997-01-03",
"1997-01-04", "1997-01-04")),
Price = c(29, 25, 14, 26, 30, 16, 13, 62, 12, 6),
IsFirst = c(T,F,F,F,T,T,F,T,F,F))
Run Code Online (Sandbox Code Playgroud)
Id:客户ID; 日期:交易日期; 价格:交易金额; isFirst:TRUE是交易是客户第一.每个客户有1个IsFirst == TRUE,[0,Inf] IsFirst == FALSE.
>DT
Id Date Price IsFirst
1: 1 1997-01-01 29 TRUE
2: 1 1997-01-02 25 FALSE
3: 1 1997-01-03 14 FALSE
4: 1 1997-01-04 26 FALSE
5: 10 1997-01-02 30 TRUE
6: 100 1997-01-02 16 TRUE
7: 100 1997-01-04 13 FALSE
8: 101 1997-01-03 62 TRUE
9: 101 1997-01-04 12 FALSE
10: 101 1997-01-04 6 FALSE
Run Code Online (Sandbox Code Playgroud)
我需要把它投入
Id 1997-01-01 1997-01-02 1997-01-03 1997-01-04
1: 1 29 25 14 25
2: 10 NA 30 0 0
3: 100 NA 16 0 13
4: 101 NA NA 62 18
Run Code Online (Sandbox Code Playgroud)
NA值应仅在客户进行第一笔交易之前出现.在第一次交易之后,缺失的值应该用0填充.我试过:
dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price", fill = NA)
Run Code Online (Sandbox Code Playgroud)
但是这不起作用,因为它用NA填充所有缺少的字段.目前我正在使用循环遍历所有Ids手动设置客户IsFirst到NA之前的字段:
DT2 <- dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price")
Ids <- unique(DT$Id)
for(id in Ids){
if(DT[(Id == id & IsFirst == T),]$Date > as.Date(names(DT2)[2])){
DT2[Id == id, 2:(which(names(DT2)==as.character(DT[(Id == id & IsFirst == T),]$Date))-1) := NA, with = F]
}
}
Run Code Online (Sandbox Code Playgroud)
当我的数据变大时,这非常慢.什么是最快,最有效的方法?
铸造-熔铸-铸造
dDT <- dcast(DT, Id~Date, sum, value.var="Price")
setDT(dDT) # if not using data.table 1.9.5+
mDT <- melt(dDT,id.vars = c("Id"), variable.name="Date", value.name="Price")
mDT[, `:=`(idi = 1:.N, first_sale = which.max(!!Price)), by=Id]
mDT[ idi < first_sale, Price := NA_real_ ]
res <- dcast(mDT, Id~Date, sum, value.var="Price")
Run Code Online (Sandbox Code Playgroud)
合并投射
setkey(DT,Id,Date)
mergeDT <- DT[, .(Price=sum(Price)), by=key(DT)][CJ(unique(Id),unique(Date))]
mergeDT[, ok := cumsum(!is.na(Price)) > 0, by=Id]
mergeDT[ok & is.na(Price), Price := 0]
res2 <- dcast(mergeDT, Id~Date, value.var="Price")
Run Code Online (Sandbox Code Playgroud)
我发现这种方式更直观,在投射之前以长形式完成所有事情。