Mer*_*rik 3 performance r data.table
我有两个data.tables,每个大小为5-10GB.它们看起来类似于以下内容.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
Run Code Online (Sandbox Code Playgroud)
我们的想法是在B中找到行,其中日期时间在A中匹配行的0-10分钟内(匹配由人完成)并在A中标记它们.问题是如何使用data.table最有效地完成它. ?
一个计划是仅基于[I]人[/ I]加入两个数据表,然后计算时差并找到时间差在0到600秒之间的行,最后用A加入后者:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Run Code Online (Sandbox Code Playgroud)
这给了我们正确的结果:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
Run Code Online (Sandbox Code Playgroud)
但是,我不确定这是否是最有效的方式.具体来说,我正在使用AB[diff < 600 & diff > 0]我假设将运行矢量搜索而不是二进制搜索,但我想不出如何使用二进制搜索来执行它.
此外,我不确定转换是否POSIXct是计算时差的最有效方法.
关于如何提高效率的任何想法都受到高度赞赏.
data.table的滚动连接非常适合此任务:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
Run Code Online (Sandbox Code Playgroud)
与预期输出的唯一区别在于它将时差分析为小于或等于10分钟(<=).如果这对你不利,你可以删除相同的匹配