我有一个以下形式的数据框:
df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"),
product = c("A", "B", "A", "D", "A"),
purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01"))
df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d")
Run Code Online (Sandbox Code Playgroud)
看起来像这样:
client product purchase_Date
1 client1 A 2010-03-02
2 client1 B 2010-02-02
3 client2 A 2009-03-02
4 client3 D 2011-04-05
5 client3 A 2012-11-01
Run Code Online (Sandbox Code Playgroud)
我想像这样重新排列:
client purchase1 purchase2
1 client1 B A
2 client2 A <NA>
3 client3 D A
Run Code Online (Sandbox Code Playgroud)
所以我想知道哪个产品是第一个,第二个,第三个等等,每个人都是按购买日期订购的.我可以使用data.table轻松地分别获取每一个:
library(data.table)
setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client]
Run Code Online (Sandbox Code Playgroud)
对于第一个.但我不知道如何有效地获得所需的输出.
这是一个可能的data.table
解决方案(如果你有超过10个购买,那么我建议避免使用,paste0
而只是使用indx := seq_len(.N)
它,因为它可能会搞乱购买订单)
setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client]
dcast(df, client ~ indx, value.var = "product")
# client purchase1 purchase2
# 1: client1 B A
# 2: client2 A NA
# 3: client3 D A
Run Code Online (Sandbox Code Playgroud)
创建col的比较frank()
和order()
方法indx
:
require(data.table)
set.seed(45L);
dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE))
dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE),
purchase_Date = as.Date(sample(14610:16586, .N, FALSE),
origin = "1970-01-01")), by=client]
system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client])
# user system elapsed
# 0.19 0.02 0.20
system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client])
# user system elapsed
# 3.94 0.00 3.98
Run Code Online (Sandbox Code Playgroud)