我有一个主表,其中包含每个personid的主要事件的日期:
dfMain <- data.frame(last = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
previous = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
personid = c(12341, 122345, 12341, 12341, 122345),
diff = c(NA, NA, 4, 28, 32))
Run Code Online (Sandbox Code Playgroud)
("之前"和"差异"变量上的NA表示此人员有他的第一个"主要偶数"即:没有以前的日期,没有时差)
我还有一个辅助表,其中包含每个personid的"辅助事件":
dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
personid = c(122345, 122345, 12341, 122345, 12341))
Run Code Online (Sandbox Code Playgroud)
我的问题是,什么是最佳方式(由于我的数据量)增加我的"dfMain"数据框与每个personid的主要事件日期之间的唯一次要事件的数量.
在虚拟示例中,我的目标是获取此表:
Occurances <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)
Run Code Online (Sandbox Code Playgroud)
使用data.table-package:
# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]
# create a reference
dfSecondary <- dfSecondary[dfMain
, on = .(personid, date > previous, date < last)
, .(dates = x.date)
, by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))
# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
, by = .(personid, previous, last)]
, on = .(personid, previous, last)
, Occ := V1][]
Run Code Online (Sandbox Code Playgroud)
这使:
Run Code Online (Sandbox Code Playgroud)last previous personid diff Occ 1: 2017-08-01 <NA> 12341 NA NA 2: 2017-08-01 <NA> 122345 NA NA 3: 2017-08-05 2017-08-01 12341 4 2 4: 2017-09-02 2017-08-05 12341 28 0 5: 2017-09-02 2017-08-01 122345 32 3
| 归档时间: |
|
| 查看次数: |
350 次 |
| 最近记录: |