考虑这些表purchases和sales
purchases <- data.table(
purchase_id = c(10,20,30,40,50,60),
store = c("a", "a", "a", "b", "b", "b"),
date = c(1,1,2,3,3,3)
)
sales <- data.table(
sale_id = c(1,2,3,4,5,6),
store = c("a", "a", "a", "b", "b", "b"),
date = c(1,1,1,3,3,4)
)
> purchases
purchase_id store date
1: 10 a 1
2: 20 a 1
3: 30 a 2
4: 40 b 3
5: 50 b 3
6: 60 b 3
> sales
sale_id store date
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 3
5: 5 b 3
6: 6 b 4
Run Code Online (Sandbox Code Playgroud)
我想将每次购买映射到同时或稍后(并且在同一家商店)发生的销售。问题是一次购买应该准确地映射到一次或无销售,反之亦然。
有多种解决方案可以满足我的要求,但一个简单的解决方案遵循以下算法:
For each purchase:
Subset sales where sale store matches purchase store and sale date >= purchase date
Select the first sale in the subset and map it to this purchase
REMOVE THIS SALE FROM THE sales TABLE!
Run Code Online (Sandbox Code Playgroud)
这会产生一个像这样的映射
purchase_id sale_id
1: 10 1
2: 20 2
3: 30 NA
4: 40 4
5: 50 5
6: 60 6
Run Code Online (Sandbox Code Playgroud)
有没有一种优雅的方法来使用 data.table 来做到这一点?
这是我开发的一个肮脏但有效的解决方案。
rolling_join_without_replacement <- function(x, i, on, roll, allow.cartesian = FALSE){
# Dirty implementation of a rolling join matching algo without replacement
# Each row in i maps to exactly one row in the result
# Each row in x maps to exactly zero or one rows in the result
# Copy x and i
x2 <- copy(x)
i2 <- copy(i)
# Create row id fields for each table
x2[, x_row := .I]
i2[, i_row := .I]
allmatches <- list()
while(TRUE){
# Execute the rolling join
matches <- x2[i2, on = on, roll = roll, allow.cartesian = allow.cartesian, nomatch = 0L]
# If no matches, break
if(nrow(matches) == 0) break
# Get the first match per i, then get the first match per x
matches <- matches[matches[, .I[1L], by = i_row]$V1]
matches <- matches[matches[, .I[1L], by = x_row]$V1]
# Save these matches
allmatches <- c(allmatches, list(matches))
# Exclude these x and i from future matches
x2 <- x2[!matches, on = "x_row"]
i2 <- i2[!matches, on = "i_row"]
}
# Combine matches
allmatches <- rbindlist(allmatches, use.names = TRUE)
# Include unmatched i rows
unmatched <- i2[!allmatches, on = "i_row"]
allmatches <- rbind(allmatches, unmatched, use.names = TRUE, fill = TRUE)
return(allmatches[])
}
Run Code Online (Sandbox Code Playgroud)
用法
rolling_join_without_replacement(
x = sales,
i = purchases,
on = c("store", "date"),
roll = -Inf,
allow.cartesian = TRUE
)
purchase_id sale_id
1: 10 1
2: 20 2
3: 30 NA
4: 40 4
5: 50 5
6: 60 6
Run Code Online (Sandbox Code Playgroud)
根据OP的说法,目标是
将每次购买映射到同时或稍后(并且在同一商店)发生的销售。问题是一次购买应该准确地映射到一次销售或无销售,反之亦然。
如果我理解正确的话,OP 会在删除购买前发生的销售事件(对于每个商店)后,将购买 ID 的向量与销售 ID 的向量对齐。
这是一种使用非等连接并rowid()选择对齐行的方法:
library(data.table)
sales[purchases, on = c("store", "date>=date"),
.(store, purchase_id, sale_id = sale_id[x.date >= i.date])][
rowid(store, purchase_id) == rowid(store, sale_id)]
Run Code Online (Sandbox Code Playgroud)
修改用例的结果(为了覆盖更多边缘情况,例如更多商店):
Run Code Online (Sandbox Code Playgroud)store purchase_id sale_id 1: a 10 1 2: a 20 2 3: a 30 NA 4: b 40 5 5: b 50 6 6: b 60 7 7: d 70 NA
请注意,这store是为了安全性和完整性而包含的,purchase_id并且sale_id可能并非在所有商店中都是唯一的。
purchases另请注意,结果很大程度上取决于和中的行顺序sales。
修改示例数据以涵盖更多边缘情况:
purchases <- data.table(
purchase_id = c(10,20,30,40,50,60,70),
store = c("a", "a", "a", "b", "b", "b", "d"),
date = c(1,1,2,3,3,3,3)
)
sales <- data.table(
sale_id = c(1,2,3,4,5,6,7,8),
store = c("a", "a", "a", "b", "b", "b", "b", "c"),
date = c(1,1,1,2,3,3,4,5)
)
purchases
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)purchase_id store date 1: 10 a 1 2: 20 a 1 3: 30 a 2 4: 40 b 3 5: 50 b 3 6: 60 b 3 7: 70 d 3
包括在商店额外购买d。
sales
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)sale_id store date 1: 1 a 1 2: 2 a 1 3: 3 a 1 4: 4 b 2 5: 5 b 3 6: 6 b 3 7: 7 b 4 8: 8 c 5
包括 2 个额外销售(第 4 行和第 8 行)和一个额外商店c。
第一个表达
sales[purchases, on = c("store", "date>=date"),
.(store, purchase_id, sale_id = sale_id[x.date >= i.date])]
Run Code Online (Sandbox Code Playgroud)
返回purchase_id与 valid的所有可能组合sale_id,即仅sale_id包含销售日期在x.date购买日期当天或之后的那些 s i.date(对于每个商店):
Run Code Online (Sandbox Code Playgroud)store purchase_id sale_id 1: a 10 1 2: a 10 2 3: a 10 3 4: a 20 1 5: a 20 2 6: a 20 3 7: a 30 NA 8: b 40 5 9: b 40 6 10: b 40 7 11: b 50 5 12: b 50 6 13: b 50 7 14: b 60 5 15: b 60 6 16: b 60 7 17: d 70 NA
第二个表达式
[rowid(store, purchase_id) == rowid(store, sale_id)]
Run Code Online (Sandbox Code Playgroud)
通过匹配 id 编号,为 的每个唯一值purchase_id以及同样和子集的每个唯一值创建 id 编号。sale_id