如何实现“无替换”滚动连接(源表中的一行应映射到结果中的 0 或 1 行)

Ben*_*Ben 5 r data.table

考虑这些表purchasessales

purchases <- data.table(
  purchase_id = c(10,20,30,40,50,60),
  store = c("a", "a", "a", "b", "b", "b"),
  date = c(1,1,2,3,3,3)
)

sales <- data.table(
  sale_id = c(1,2,3,4,5,6),
  store = c("a", "a", "a", "b", "b", "b"),
  date = c(1,1,1,3,3,4)
)

> purchases
    purchase_id store date
1:           10     a    1
2:           20     a    1
3:           30     a    2
4:           40     b    3
5:           50     b    3
6:           60     b    3
> sales
   sale_id store date
1:       1     a    1
2:       2     a    1
3:       3     a    1
4:       4     b    3
5:       5     b    3
6:       6     b    4
Run Code Online (Sandbox Code Playgroud)

我想将每次购买映射到同时或稍后(并且在同一家商店)发生的销售。问题是一次购买应该准确地映射到一次或无销售,反之亦然

有多种解决方案可以满足我的要求,但一个简单的解决方案遵循以下算法:

For each purchase:
  Subset sales where sale store matches purchase store and sale date >= purchase date
  Select the first sale in the subset and map it to this purchase
  REMOVE THIS SALE FROM THE sales TABLE!
Run Code Online (Sandbox Code Playgroud)

这会产生一个像这样的映射

    purchase_id sale_id
1:           10       1
2:           20       2
3:           30      NA
4:           40       4
5:           50       5
6:           60       6
Run Code Online (Sandbox Code Playgroud)

有没有一种优雅的方法来使用 data.table 来做到这一点?


脏溶液

这是我开发的一个肮脏但有效的解决方案。

rolling_join_without_replacement <- function(x, i, on, roll, allow.cartesian = FALSE){
  # Dirty implementation of a rolling join matching algo without replacement
  # Each row in i maps to exactly one row in the result
  # Each row in x maps to exactly zero or one rows in the result
  
  # Copy x and i
  x2 <- copy(x)
  i2 <- copy(i)
  
  # Create row id fields for each table
  x2[, x_row := .I]
  i2[, i_row := .I]
  
  allmatches <- list()
  while(TRUE){
    
    # Execute the rolling join
    matches <- x2[i2, on = on, roll = roll, allow.cartesian = allow.cartesian, nomatch = 0L]
    
    # If no matches, break
    if(nrow(matches) == 0) break
    
    # Get the first match per i, then get the first match per x
    matches <- matches[matches[, .I[1L], by = i_row]$V1]
    matches <- matches[matches[, .I[1L], by = x_row]$V1]
    
    # Save these matches
    allmatches <- c(allmatches, list(matches))
    
    # Exclude these x and i from future matches
    x2 <- x2[!matches, on = "x_row"]
    i2 <- i2[!matches, on = "i_row"]
  }
  
  # Combine matches
  allmatches <- rbindlist(allmatches, use.names = TRUE)
  
  # Include unmatched i rows
  unmatched <- i2[!allmatches, on = "i_row"]
  allmatches <- rbind(allmatches, unmatched, use.names = TRUE, fill = TRUE)
  
  return(allmatches[])
}
Run Code Online (Sandbox Code Playgroud)

用法

rolling_join_without_replacement(
  x = sales, 
  i = purchases, 
  on = c("store", "date"), 
  roll = -Inf, 
  allow.cartesian = TRUE
)

    purchase_id sale_id
1:           10       1
2:           20       2
3:           30      NA
4:           40       4
5:           50       5
6:           60       6
Run Code Online (Sandbox Code Playgroud)

Uwe*_*Uwe 5

根据OP的说法,目标是

将每次购买映射到同时或稍后(并且在同一商店)发生的销售。问题是一次购买应该准确地映射到一次销售或无销售,反之亦然。

如果我理解正确的话,OP 会在删除购买前发生的销售事件(对于每个商店)后,将购买 ID 的向量与销售 ID 的向量对齐。

这是一种使用非等连接rowid()选择对齐行的方法:

library(data.table)
sales[purchases, on = c("store", "date>=date"), 
  .(store, purchase_id, sale_id = sale_id[x.date >= i.date])][
    rowid(store, purchase_id) == rowid(store, sale_id)]
Run Code Online (Sandbox Code Playgroud)

修改用例的结果(为了覆盖更多边缘情况,例如更多商店):

   store purchase_id sale_id
1:     a          10       1
2:     a          20       2
3:     a          30      NA
4:     b          40       5
5:     b          50       6
6:     b          60       7
7:     d          70      NA
Run Code Online (Sandbox Code Playgroud)

请注意,这store是为了安全性和完整性而包含的,purchase_id并且sale_id可能并非在所有商店中都是唯一的。

purchases另请注意,结果很大程度上取决于和中的行顺序sales

数据

修改示例数据以涵盖更多边缘情况:

purchases <- data.table(
  purchase_id = c(10,20,30,40,50,60,70),
  store = c("a", "a", "a", "b", "b", "b", "d"),
  date = c(1,1,2,3,3,3,3)
)

sales <- data.table(
  sale_id = c(1,2,3,4,5,6,7,8),
  store = c("a", "a", "a", "b", "b", "b", "b", "c"),
  date = c(1,1,1,2,3,3,4,5)
)

purchases
Run Code Online (Sandbox Code Playgroud)
   purchase_id store date
1:          10     a    1
2:          20     a    1
3:          30     a    2
4:          40     b    3
5:          50     b    3
6:          60     b    3
7:          70     d    3
Run Code Online (Sandbox Code Playgroud)

包括在商店额外购买d

sales
Run Code Online (Sandbox Code Playgroud)
   sale_id store date
1:       1     a    1
2:       2     a    1
3:       3     a    1
4:       4     b    2
5:       5     b    3
6:       6     b    3
7:       7     b    4
8:       8     c    5
Run Code Online (Sandbox Code Playgroud)

包括 2 个额外销售(第 4 行和第 8 行)和一个额外商店c

解释

第一个表达

sales[purchases, on = c("store", "date>=date"), 
  .(store, purchase_id, sale_id = sale_id[x.date >= i.date])]
Run Code Online (Sandbox Code Playgroud)

返回purchase_id与 valid的所有可能组合sale_id,即仅sale_id包含销售日期在x.date购买日期当天或之后的那些 s i.date(对于每个商店):

    store purchase_id sale_id
 1:     a          10       1
 2:     a          10       2
 3:     a          10       3
 4:     a          20       1
 5:     a          20       2
 6:     a          20       3
 7:     a          30      NA
 8:     b          40       5
 9:     b          40       6
10:     b          40       7
11:     b          50       5
12:     b          50       6
13:     b          50       7
14:     b          60       5
15:     b          60       6
16:     b          60       7
17:     d          70      NA
Run Code Online (Sandbox Code Playgroud)

第二个表达式

[rowid(store, purchase_id) == rowid(store, sale_id)]
Run Code Online (Sandbox Code Playgroud)

通过匹配 id 编号,为 的每个唯一值purchase_id以及同样和子集的每个唯一值创建 id 编号。sale_id