Uwe*_*Uwe 7 join r self-join data.table
在准备问题dplyr 或 data.table的答案以计算 R 中的时间序列聚合时,我注意到我确实得到了不同的结果,具体取决于表是就地更新还是作为新对象返回。此外,当我更改非等连接条件中的列顺序时,我确实得到了不同的结果。
目前,我对此没有解释,可能是由于我这边的重大误解或简单的编码错误。
请注意,这个问题特别要求解释观察到的
data.table连接行为。如果您对潜在问题有其他解决方案,请随时发布原始问题的答案。
最初的问题是如何使用这些数据计算每位患者在住院前 365 天内的住院次数(包括实际住院次数):
library(data.table) # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7",
"2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)patient.id hospitalization.date 1: 1 2013-10-15 2: 1 2015-07-16 3: 1 2016-01-07 4: 2 2014-10-15 5: 2 2015-12-20 6: 2 2015-12-25 7: 2 2016-02-10
下面的代码给出了预期的答案(为了清楚起见,这里添加了额外的帮助列)
# add helper columns
DT0[, start.date := hospitalization.date - 365][
, end.date := hospitalization.date][]
DT0
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)patient.id hospitalization.date start.date end.date 1: 1 2013-10-15 2012-10-15 2013-10-15 2: 1 2015-07-16 2014-07-16 2015-07-16 3: 1 2016-01-07 2015-01-07 2016-01-07 4: 2 2014-10-15 2013-10-15 2014-10-15 5: 2 2015-12-20 2014-12-20 2015-12-20 6: 2 2015-12-25 2014-12-25 2015-12-25 7: 2 2016-02-10 2015-02-10 2016-02-10
result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)patient.id hospitalization.date hospitalization.date hospitalizations.last.year 1: 1 2012-10-15 2013-10-15 1 2: 1 2014-07-16 2015-07-16 1 3: 1 2015-01-07 2016-01-07 2 4: 2 2013-10-15 2014-10-15 1 5: 2 2014-12-20 2015-12-20 1 6: 2 2014-12-25 2015-12-25 2 7: 2 2015-02-10 2016-02-10 3
除了重命名和重复的列名(保留原样用于比较)。
对于patient.id == 2,最后一行的结果为 3,因为患者自 2015-02-10 以来第三次于 2016-02-10 住院。
result是一个data.table占用额外内存的新对象。我尝试使用以下方法更新原始data.table对象:
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 2 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 3 7: 2 2016-02-10 2015-02-10 2016-02-10 3
DT现在已经更新到位,但第 5 行和第 6 行现在显示 3 次住院治疗,而不是 1 次或 2 次。现在似乎每一行都返回了上一时期的住院总数。
此外,非等连接条件中的列顺序也很重要,即使在自连接中也是如此:
result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
Run Code Online (Sandbox Code Playgroud)
我的期望是这"start.date<=hospitalization.date"将等同于"hospitalization.date>=start.date"(注意,也<和>被切换)但结果
Run Code Online (Sandbox Code Playgroud)patient.id start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2013-10-15 1 2: 1 2015-07-16 2015-07-16 2 3: 1 2016-01-07 2016-01-07 1 4: 2 2014-10-15 2014-10-15 1 5: 2 2015-12-20 2015-12-20 3 6: 2 2015-12-25 2015-12-25 2 7: 2 2016-02-10 2016-02-10 1
是不同的。似乎现在正在统计即将到来的住院人数
有趣的是,就地更新现在确实返回相同的结果(除了一些列名):
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 1 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 2 7: 2 2016-02-10 2015-02-10 2016-02-10 1
有一个可能相关的问题导致在 github 上报告了一个问题。
分组的by=.EACHI意思是“按每个 i”而不是“按每个 x”。
# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")
z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
.(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]
x.h.date patient.id i.start.date i.end.date g N
1: 2013-10-15 1 2012-10-15 2013-10-15 1 1 *
2: 2015-07-16 1 2014-07-16 2015-07-16 2 1
3: 2015-07-16 1 2015-01-07 2016-01-07 3 2 *
4: 2016-01-07 1 2015-01-07 2016-01-07 3 2 *
5: 2014-10-15 2 2013-10-15 2014-10-15 4 1 *
6: 2015-12-20 2 2014-12-20 2015-12-20 5 1
7: 2015-12-20 2 2014-12-25 2015-12-25 6 2
8: 2015-12-25 2 2014-12-25 2015-12-25 6 2
9: 2015-12-20 2 2015-02-10 2016-02-10 7 3 *
10: 2015-12-25 2 2015-02-10 2016-02-10 7 3 *
11: 2016-02-10 2 2015-02-10 2016-02-10 7 3 *
Run Code Online (Sandbox Code Playgroud)
对于患者 1,组是
.(start.date = 2012-10-15, end.date = 2013-10-15), 计数 1.(start.date = 2014-07-16, end.date = 2015-07-16), 计数 1.(start.date = 2015-01-07, end.date = 2016-01-07), 计数 2幸运的是,这个连接中有 7 个组,而原始表中有 7 行。
对于更棘手的问题,我将从我的笔记中借用一个例子:
当心更新连接中的多个匹配项。当有多个匹配项时,更新连接显然只会使用最后一个。不幸的是,这是悄无声息地完成的。尝试:
Run Code Online (Sandbox Code Playgroud)a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) b = data.table(id = 1:2, y = c(11L, 15L)) b[a, on=.(id), x := i.x, verbose = TRUE ][] # Calculated ad hoc index in 0 secs # Starting bmerge ...done in 0.02 secs # Detected that j uses these columns: x,i.x # Assigning to 3 row subset of 2 rows # id y x # 1: 1 11 12 # 2: 2 15 13启用详细信息后,我们会看到一条关于“分配给 2 行的 3 行子集”的有用消息。
-- 修改自“快速 R 教程”,“在连接中更新”部分
在OP的情况下,verbose=TRUE也没有提供这样的消息,很遗憾。
DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0 secs
# Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none>
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.000s for 7 groups
# eval(j) took 0.000s for 7 calls
# 0.01 secs
Run Code Online (Sandbox Code Playgroud)
但是,我们可以看到x每组的最后一行确实包含 OP 看到的值。我在上面用星号手动标记了这些。或者,您可以用 标记它们z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date]。
作为参考,这里的更新加入是...
DT[, n :=
.SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N
]
patient.id hospitalization.date start.date end.date h.date n
1: 1 2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2: 1 2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3: 1 2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4: 2 2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5: 2 2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6: 2 2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7: 2 2016-02-10 2015-02-10 2016-02-10 2016-02-10 3
Run Code Online (Sandbox Code Playgroud)
这是处理这种情况的正确/惯用方法,x基于查找x另一个表中的每一行并计算结果摘要来添加列:
x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]
Run Code Online (Sandbox Code Playgroud)