我曾经用dplyr实现我的数据争论,但有些计算是"慢"的.特别是按组子集,我读到dplyr很慢,当有很多组并且基于这个基准数据时.表可能更快,所以我开始学习data.table.
以下是如何使用250k行和大约230k组重现与我的实际数据接近的内容.我想按id1,id2进行分组,并将max(datetime)每个组的行子集化.
# random datetime generation function by Dirk Eddelbuettel
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/05") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42)
# Creating 230000 ids couples
ids <- data.frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"),
id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- …Run Code Online (Sandbox Code Playgroud)