有条件地计算2个日期之间每个ID的唯一日期数

Vas*_*iou 4 r date dataframe

我有一个主表,其中包含每个personid的主要事件的日期:

dfMain <- data.frame(last    = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
                 previous    = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
                 personid    = c(12341, 122345, 12341, 12341, 122345),
                 diff        = c(NA, NA, 4, 28, 32))
Run Code Online (Sandbox Code Playgroud)

("之前"和"差异"变量上的NA表示此人员有他的第一个"主要偶数"即:没有以前的日期,没有时差)

我还有一个辅助表,其中包含每个personid的"辅助事件":

dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
                      personid = c(122345, 122345, 12341, 122345, 12341))
Run Code Online (Sandbox Code Playgroud)

我的问题是,什么是最佳方式(由于我的数据量)增加我的"dfMain"数据框与每个personid的主要事件日期之间的唯一次要事件的数量.

在虚拟示例中,我的目标是获取此表:

Occurances  <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)
Run Code Online (Sandbox Code Playgroud)

Jaa*_*aap 5

使用data.table-package:

# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]

# create a reference
dfSecondary <- dfSecondary[dfMain
                           , on = .(personid, date > previous, date < last)
                           , .(dates = x.date)
                           , by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))

# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
                                        , by = .(personid, previous, last)]
       , on = .(personid, previous, last)
       , Occ := V1][]
Run Code Online (Sandbox Code Playgroud)

这使:

         last   previous personid diff Occ
1: 2017-08-01       <NA>    12341   NA  NA
2: 2017-08-01       <NA>   122345   NA  NA
3: 2017-08-05 2017-08-01    12341    4   2
4: 2017-09-02 2017-08-05    12341   28   0
5: 2017-09-02 2017-08-01   122345   32   3
Run Code Online (Sandbox Code Playgroud)