聚合时data.table多列非等值联接的性能降低

Eth*_*han 5 join r multiple-columns data.table

我正在尝试查找性能问题,并将其很大程度上隔离为多列非等额联接。以下是我尝试做的事情的合理(但不是确切)示例,以及时间安排。

library(quantmod)
library(data.table)

p <- last(OHLC(getSymbols("SPY", auto.assign = F,)), 700)
d <- as.data.table(p) #convert to a data.table for processing
d[, index := as.POSIXct(index)] #to match my use case. leaving as Date does not significantly alter timings
setnames(d, c("index", "Open", "High", "Low", "Close"))

# create partitions for analysis
partitions = unique(d[d, .(Top = x.Close, Bot = i.Close, Start = pmin(x.index, i.index)),
    on = .(Close >= Close), allow.cartesian = T][!is.na(Start)])

#desired analysis
system.time(r1 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
    on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI])
#7.67
Run Code Online (Sandbox Code Playgroud)

具有相同数据集的单列连接要快得多(但不会产生所需的结果)。只是在这里进行时间比较:

system.time(r2 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
    on = .(Close >= Bot, Close <= Top), allow.cartesian = T, by = .EACHI])
#4.4
system.time(r4 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
    on = .(index >= Start), allow.cartesian = T, by = .EACHI])
#4.67
Run Code Online (Sandbox Code Playgroud)

我知道,如果减少partition表中的行数,我可以加快速度,但是我已经尽我所能走了很远,而且速度仍然很慢。我也理解这要求在引擎盖下实现非常大的连接,但是仅凭单列约束,该实现的连接就更大了,因此相对性能仍然困扰着我。

难道我做错了什么?我真的不明白为什么添加第二列条件会导致如此急剧的下降。关于如何使其更快的任何建议?

编辑7/30/18

因此,在verbose=T尝试了非常有用的功能之后,我发现了问题的另一个方面。median()在这种情况下,这似乎非常慢:

首先,使用mean()带有详细输出的现有分析:

r1 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
    on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI, verbose = T]

Non-equi join operators detected ... 
  forder took ... 0.000sec 
  Generating non-equi group ids ... done in 0.000sec 
  Recomputing forder with non-equi ids ... done in 0.000sec 
  Found 26 non-equi group(s) ...
Starting bmerge ...done in 0.790sec 
Detected that j uses these columns: i.Top,i.Bot,i.Start,x.Close 
lapply optimization is on, j unchanged as 'list(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close))'
Old mean optimization changed j from 'list(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close))' to 'list(i.Top, i.Bot, i.Start, .External(Cfastmean, x.Close, FALSE), sd(x.Close))'
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 0.077s for 235273 groups
  eval(j) took 4.475s for 235273 calls
4.690sec 
Run Code Online (Sandbox Code Playgroud)

接下来,再次使用中值()进行类似分析,并输出详细信息:

r1 <- d[partitions, .(i.Top, i.Bot, i.Start, median(x.Close), sd(x.Close)),
    on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI, verbose = T]
Non-equi join operators detected ... 
  forder took ... 0.000sec 
  Generating non-equi group ids ... done in 0.000sec 
  Recomputing forder with non-equi ids ... done in 0.000sec 
  Found 26 non-equi group(s) ...
Starting bmerge ...done in 0.810sec 
Detected that j uses these columns: i.Top,i.Bot,i.Start,x.Close 
lapply optimization is on, j unchanged as 'list(i.Top, i.Bot, i.Start, median(x.Close), sd(x.Close))'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 0.079s for 235273 groups
  eval(j) took 12.826s for 235273 calls
13.1sec 
Run Code Online (Sandbox Code Playgroud)

以供参考:

> getOption("datatable.optimize")
[1] Inf
Run Code Online (Sandbox Code Playgroud)

所以,我想另一个问题是:median()non-equi通过by连接的上下文中,有什么方法可以加快通话速度吗? .EACHI