分类整数数据的最快方法

Bay*_*rSe 6 r

我在下表中给出了一个映射:

Input  Output
<4     0
5      0.4
6      0.5
7      0.65
8      0.75
9      0.85
>=10   1
Run Code Online (Sandbox Code Playgroud)

到现在为止,我写了3个版本:

k1 <- function(h) {
  if (h <= 4) { k <- 0
  } else if (h == 5) { k <- 0.4
  } else if (h == 6) { k <- 0.5
  } else if (h == 7) { k <- 0.65
  } else if (h == 8) { k <- 0.75
  } else if (h == 9) { k <- 0.85
  } else if (h >= 10) { k <- 1}
  return(k)
}
Run Code Online (Sandbox Code Playgroud)

第二:

k2 <- function(h) {
  k <- 0
  k[h == 5] <- 0.4
  k[h == 6] <- 0.5
  k[h == 7] <- 0.65
  k[h == 8] <- 0.75
  k[h == 9] <- 0.85
  k[h >= 10] <- 1.0
  return(k)
}
Run Code Online (Sandbox Code Playgroud)

第三:

k3 <- function(h) {
  k <- cut(h, breaks=c(0, 5, 6, 7, 8, 9, Inf), labels=c(0, 0.5, 0.65, 0.75, 0.85, 1), right=FALSE)  
  return(k)
}
Run Code Online (Sandbox Code Playgroud)

我需要在两种不同的场景中使用该功能.首先,评估标量输入,然后评估值的向量.

对于标量输入:

h <- 5
microbenchmark(k1(h), k2(h), k3(h))

Unit: microseconds
  expr    min      lq     mean  median     uq     max neval
 k1(h)  1.208  1.5110  2.38264  1.8125  2.114  15.698   100
 k2(h)  4.529  5.5855  8.71286  6.3400  7.849  73.053   100
 k3(h) 52.224 54.0360 71.74953 68.9785 79.393 304.286   100
Run Code Online (Sandbox Code Playgroud)

对于矢量输入:

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h))
Unit: microseconds
          expr     min      lq     mean   median       uq     max neval
 sapply(h, k1) 595.592 617.327 641.8598 637.8535 654.9100 857.918   100
         k2(h)  15.397  17.207  19.5470  18.1130  19.6225  49.508   100
         k3(h) 110.486 116.070 131.3117 121.2020 140.6720 275.910   100
Run Code Online (Sandbox Code Playgroud)

因此,k1对于标量输入是最快的,对于向量输入是k2.

你认为有可能提高速度吗?我无法相信这样一个笨拙的if/else代码在标量情况下应该是最快的.而且,我想要一个统一的功能,而不是两个单独的功能.

Dav*_*son 8

首先,为什么要在标量输入上优化几微秒?如果答案是"因为标量版本必须多次调用",或许是在一个循环中,而不是那个问题; 操作应该是矢量化的.(请注意,您k2可以在k1处理15个时间内处理250个输入).

无论如何,另一种选择是:

outputs <- c(0, .4, .5, .65, .75, .85, 1)
k4 <- function(h) {
  output[pmin.int(pmax.int(h, 4), 10) - 3]
}
Run Code Online (Sandbox Code Playgroud)

在我的系统中,这只是k2在矢量化情况下的关系,但它的速度大约k2是标量情况的两倍:

h <- 5
microbenchmark(k1(h), k2(h), k3(h), k4(h))
# Unit: nanoseconds
#   expr   min      lq     mean  median      uq    max neval
#  k1(h)   748   933.5  1314.29  1181.5  1655.0   3091   100
#  k2(h)  4131  5424.5  6378.31  6236.5  7021.5  18140   100
#  k3(h) 72149 74495.0 79796.22 75716.0 80936.5 176857   100
#  k4(h)  1730  2259.5  3396.04  3338.5  3801.0  17001   100

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h), k4(h))
# Unit: microseconds
#           expr     min       lq      mean   median       uq     max neval
#  sapply(h, k1) 311.099 327.5710 341.05200 335.9330 348.6405 405.830   100
#          k2(h)  13.973  18.4965  20.64351  20.4160  22.4015  34.289   100
#          k3(h) 117.401 125.0180 134.49228 129.2455 138.8240 241.896   100
#          k4(h)  15.042  17.8870  20.33141  19.0690  20.4260  37.386   100
Run Code Online (Sandbox Code Playgroud)

它也更简洁k2,更容易扩展到更多的整数输入.

最后,如果你愿意依赖Rcpp,你可以相对于k2和获得5倍的加速k4:

library(Rcpp)
cppFunction('NumericVector k5(IntegerVector h) {
              int n = h.size();
              NumericVector out(n);

              for (int i = 0; i < n; ++i) {
                int val = h[i];
                if (val <= 4) out[i] = 0;
                else if (val == 5) out[i] = .4;
                else if (val == 6) out[i] = .5;
                else if (val == 7) out[i] = .65;
                else if (val == 8) out[i] = .75;
                else if (val == 9) out[i] = .85;
                else if (val >= 10) out[i] = 1;
              }
              return out;
            }')

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h), k4(h), k5(h))

# Unit: microseconds
#           expr     min       lq      mean   median       uq     max neval
#  sapply(h, k1) 382.383 410.7310 429.88844 423.7150 442.5765 501.400   100
#          k2(h)  17.129  20.5865  23.95221  22.1340  23.7915  46.827   100
#          k3(h) 123.519 127.6830 142.24084 140.5400 150.1525 218.919   100
#          k4(h)  15.168  18.2705  20.45797  19.1985  20.6105  52.650   100
#          k5(h)   2.988   4.9045   6.49218   5.9135   6.8455  33.219   100
Run Code Online (Sandbox Code Playgroud)

(请参阅此Rcpp介绍的"矢量输入,矢量输出"部分,作为此类功能的指南).但请注意,它仍然比k1标量情况慢2倍!


Kha*_*haa 7

findInterval 是R中最快的功能

out <- c(0, .4, .5, .65, .75, .85, 1)
k6 <- function(h){
    ind <- findInterval(h, c(4, 5, 6, 7, 8, 9) +0.1) + 1
    out[ind]
}

h <- rep(5, 250)
microbenchmark(k2(h), k4(h), k6(h), unit="relative")
Unit: relative
#  expr      min       lq     mean   median       uq      max neval
# k2(h) 2.283983 2.347714 2.225037 2.392578 2.319125 1.184224   100
# k4(h) 1.830939 1.725286 1.699866 1.701196 1.599973 1.414026   100
# k6(h) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
Run Code Online (Sandbox Code Playgroud)