BETWEEN如何合并data.table方式?

Mer*_*rik 3 performance r data.table

我有两个data.tables,每个大小为5-10GB.它们看起来类似于以下内容.

library(data.table)
A <- data.table(
  person = c(1,1,1,2,3,3,3,3,4,4),
  datetime = c(
    '2015-04-06 14:22:18',
    '2015-04-07 02:55:32',
    '2015-11-21 10:16:05',
    '2015-10-03 13:37:29',
    '2015-02-26 23:51:56',
    '2015-05-16 18:21:44',
    '2015-06-02 04:07:43',
    '2015-11-28 15:22:36',
    '2015-01-19 04:10:22',
    '2015-01-24 02:18:11'
  )
)

B <- data.table(
  person = c(1,1,3,4,4,5),
  datetime2 = c(
    '2015-04-06 14:24:59',
    '2015-11-28 15:22:36',
    '2015-06-02 04:07:43',
    '2015-01-19 06:10:22',
    '2015-01-24 02:18:18',
    '2015-04-06 14:22:18'
  )
)

A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
Run Code Online (Sandbox Code Playgroud)

我们的想法是在B中找到行,其中日期时间在A中匹配行的0-10分钟内(匹配由人完成)并在A中标记它们.问题是如何使用data.table最有效地完成它. ?

一个计划是仅基于[I]人[/ I]加入两个数据表,然后计算时差并找到时间差在0到600秒之间的行,最后用A加入后者:

setkey(A,person)
AB <- A[B,.(datetime,
            datetime2,
            diff = difftime(datetime2, datetime, units = "secs"))
        , by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Run Code Online (Sandbox Code Playgroud)

这给了我们正确的结果:

    person            datetime           datetime2     diff
 1:      1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
 2:      1 2015-04-07 02:55:32                <NA>  NA secs
 3:      1 2015-11-21 10:16:05                <NA>  NA secs
 4:      2 2015-10-03 13:37:29                <NA>  NA secs
 5:      3 2015-02-26 23:51:56                <NA>  NA secs
 6:      3 2015-05-16 18:21:44                <NA>  NA secs
 7:      3 2015-06-02 04:07:43                <NA>  NA secs
 8:      3 2015-11-28 15:22:36                <NA>  NA secs
 9:      4 2015-01-19 04:10:22                <NA>  NA secs
10:      4 2015-01-24 02:18:11 2015-01-24 02:18:18   7 secs
Run Code Online (Sandbox Code Playgroud)

但是,我不确定这是否是最有效的方式.具体来说,我正在使用AB[diff < 600 & diff > 0]我假设将运行矢量搜索而不是二进制搜索,但我想不出如何使用二进制搜索来执行它.

此外,我不确定转换是否POSIXct是计算时差的最有效方法.

关于如何提高效率的任何想法都受到高度赞赏.

Mak*_*duk 5

data.table的滚动连接非常适合此任务:

B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]

   person           datetime2   datetime
 1:      1 2015-04-06 14:24:59 1428319338
 2:      1                  NA 1428364532
 3:      1                  NA 1448090165
 4:      2                  NA 1443868649
 5:      3                  NA 1424983916
 6:      3                  NA 1431789704
 7:      3 2015-06-02 04:07:43 1433207263
 8:      3                  NA 1448713356
 9:      4                  NA 1421629822
10:      4 2015-01-24 02:18:18 1422055091
Run Code Online (Sandbox Code Playgroud)

与预期输出的唯一区别在于它将时差分析为小于或等于10分钟(<=).如果这对你不利,你可以删除相同的匹配

  • 无需取消分类.使用`POSIXct`可以正常工作(因为它是带有属性的数值,如eddi提到的). (2认同)