相对窗口运行总和通过data.table非equi连接

Question

相对窗口运行总和通过data.table非equi连接

我有一个数据集customerId,transactionDate,productId,purchaseQty加载到data.table中.对于每一行,我想计算前45天的总和,以及购买数量的平均值

        productId customerID transactionDate purchaseQty
 1:    870826    1186951      2016-03-28      162000
 2:    870826    1244216      2016-03-31        5000
 3:    870826    1244216      2016-04-08        6500
 4:    870826    1308671      2016-03-28      221367
 5:    870826    1308671      2016-03-29       83633
 6:    870826    1308671      2016-11-29       60500

Run Code Online (Sandbox Code Playgroud)

我正在寻找这样的输出:

    productId customerID transactionDate purchaseQty    sumWindowPurchases
 1:    870826    1186951      2016-03-28      162000                162000
 2:    870826    1244216      2016-03-31        5000                  5000
 3:    870826    1244216      2016-04-08        6500                 11500
 4:    870826    1308671      2016-03-28      221367                221367
 5:    870826    1308671      2016-03-29       83633                305000
 6:    870826    1308671      2016-11-29       60500                 60500

Run Code Online (Sandbox Code Playgroud)

因此,sumWindowPurchases包含当前交易日期45天窗口内客户/产品的purchaseQty总和.一旦我有了这个工作,抛出我需要的平均值和其他计算应该是微不足道的

我回到我的SQL根源并想到了一个自我加入:

select   DT.customerId, DT.transactionDate, DT.productId, sum(DT1.purchaseQty)
from     DT
         inner join DT as DT1 on 
             DT.customerId = DT1.customerId
             and DT.productId =  DT1.productId
             and DT1.transactionDate between DT.transactionDate and dateadd(day, -45, DT.transactionDate)

Run Code Online (Sandbox Code Playgroud)

尝试使用data.dable语法将其转换为R,我希望做到这样的事情:

DT1 <- DT #alias. have confirmed this is just a pointer
DT[DT1[DT1$transactionDate >= DT$transactionDate - 45],
   .(sum(DT1$purchaseQty)), 
   by = .(DT$customerId , DT$transactionDate ), 
   on = .(customerId , DT1$transactionDate <= DT$TransactionDate), 
   allow.cartesian = TRUE]

Run Code Online (Sandbox Code Playgroud)

我想我有两个问题.什么是"R方式"来做到这一点.data.table是自我加入正确的方法,还是woudl我更好地尝试使用Reduce函数？

我怀疑自我加入是在那里获得滚动45天窗口的唯一方法.所以第2部分是我需要一些data.table语法的帮助来显式引用列来自哪个源表,因为它是自连接并且它们具有相同的列名.

我一直在研究弗兰克与之相关的答案,并提出了这个表达方式

DT[.(p = productId, c = customerID, t = transactionDate, start = transactionDate - 45),
        on = .(productId==p, customerID==c, transactionDate<=t, transactionDate>=start),
        allow.cartesian = TRUE, nomatch = 0]

Run Code Online (Sandbox Code Playgroud)

产生这个输出:

   productId customerID transactionDate purchaseQty transactionDate.1
1:    870826    1186951      2016-03-28      162000        2016-02-12
2:    870826    1244216      2016-03-31        5000        2016-02-15
3:    870826    1244216      2016-04-08        5000        2016-02-23
4:    870826    1244216      2016-04-08        6500        2016-02-23
5:    870826    1308671      2016-03-28      221367        2016-02-12
6:    870826    1308671      2016-03-29      221367        2016-02-13
7:    870826    1308671      2016-03-29       83633        2016-02-13
8:    870826    1308671      2016-11-29       60500        2016-10-15

Run Code Online (Sandbox Code Playgroud)

这非常接近,我需要走到最后一步.如果我可以总结此输出的购买数量,按客户/产品/ transactionDate.1分组,我会有一些有用的东西.但是,我无法得到语法,我不知道transactionDate.1名称的来源

Answer 1

Eth*_*han 1

这也可行，可以认为更简单。它的优点是不需要排序的输入集，并且依赖性较少。

我仍然不明白为什么它在输出中产生 2 个 transactionDate 列。这似乎是“on”子句的副产品。事实上，输出的列和顺序似乎将总和附加在 on 子句的所有元素之后，而没有它们的别名

DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
    on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
    .(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	424 次
最近记录：	8 年，11 月前