具有右闭合间隔的findInterval()

Ken*_*ams 19 r binary-search

findInterval()R中的强大函数在其vec参数中使用左闭的子区间,如其文档中所示:

如果i <- findInterval(x,v),我们有v[i[j]] <= x[j] < v[i[j] + 1]

如果我想要右关闭的子间隔,我的选择是什么?我想出的最好的是:

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}
Run Code Online (Sandbox Code Playgroud)

另一个也有效:

findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}
Run Code Online (Sandbox Code Playgroud)

这是一个小测试:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
findInterval(x, vec)
# [1] 1 3 3 3 3 4 4
findInterval.rightClosed(x, vec)
# [1] 1 2 3 3 3 4 4
findInterval.rightClosed2(x, vec)
# [1] 1 2 3 3 3 4 4
Run Code Online (Sandbox Code Playgroud)

但是如果有更好的解决方案,我希望看到任何其他解决方案."更好",我的意思是"某种程度上更令人满意"或"感觉不像是一种污垢",甚至可能"更有效".=)

(请注意,有一个rightmost.closed参数findInterval(),但它有所不同 - 它只是指最后的子区间,并且具有不同的含义.)

Ben*_*nes 10

编辑:所有过道的主要清理工作.

你可能会看cut.默认情况下,cut使左开放和右闭合间隔,并且可以使用适当的参数(right)更改.要使用您的示例:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x
Run Code Online (Sandbox Code Playgroud)

现在创建四个应该做同样事情的函数:两个来自OP,一个来自Josh O'Brien,然后cut.cut从默认设置更改的两个参数:include.lowest = TRUE将在最小(最左侧)间隔的两侧创建一个间隔关闭.labels = FALSE将导致cut仅返回bin的整数值,而不是创建一个因子,否则它会.

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
    cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
    xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}
Run Code Online (Sandbox Code Playgroud)

所有函数都返回相同的结果吗?对.(注意使用cutVecfor cutFun)

mapply(identical, list(findInterval.rightClosed(x, vec)),
  list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
Run Code Online (Sandbox Code Playgroud)

现在对bin的要求更高:

x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))
Run Code Online (Sandbox Code Playgroud)

测试是否相同(注意使用unname)

mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
  list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
Run Code Online (Sandbox Code Playgroud)

和基准:

library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
  cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
#                                expr       min        lq    median        uq       max
# 1                    cutFun(x, vec)  35.46261  35.63435  35.81233  36.68036  53.52078
# 2                     fiFun(x, vec)  51.30158  51.69391  52.24277  53.69253  67.09433
# 3  findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec)  79.81685  82.01025  86.20182  95.65368 108.51624
Run Code Online (Sandbox Code Playgroud)

从这次运行来看,cut似乎是最快的.