如何加速或矢量化for循环?

use*_*239 7 performance for-loop r vectorization rcpp

我想通过矢量化或使用Data.table或其他方法来提高for循环的速度.我必须在1,000,000行上运行代码,我的代码非常慢.

代码是相当不言自明的.我在下面提供了一个解释,以防万一.我已经包含了函数的输入和输出.希望你能帮助我更快地完成这项功能.

我的 目标是将矢量"Volume"分区,其中每个bin等于100份.向量"卷"包含交易的股票数量.这是它的样子:

head(Volume, n = 60)
[1]  5  3  1  5  3  1  1  1  1  1  1  1 18  1  1 18  2  7 13  2  7 13  3  2  1  1  3  2  1  1  1
[32]  1  6  6  1  1  1  1  1  1  1  1 18  2  1  1  2  1 14 18  2  1  1  2  1 14  1  1  9  5
Run Code Online (Sandbox Code Playgroud)

向量"binIdexVector"与"Volume"的长度相同,它包含bin号; 即前100个股票的每个元素得到数字1,接下来100个股票的每个元素得到数字2,接下来100个股票的每个元素得到数字3,依此类推.这是矢量的样子:

 head(binIdexVector, n = 60)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[48] 2 2 3 3 3 3 3 3 3 3 3 3 3
Run Code Online (Sandbox Code Playgroud)

这是我的功能:

#input as a vector
Volume<-c(5L, 3L, 1L, 5L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 
                   18L, 2L, 7L, 13L, 2L, 7L, 13L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 
                   1L, 1L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 2L, 1L, 
                   1L, 2L, 1L, 14L, 18L, 2L, 1L, 1L, 2L, 1L, 14L, 1L, 1L, 9L, 5L, 
                   2L, 1L, 1L, 1L, 1L, 9L, 5L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 
                   1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 2L, 9L, 9L, 3L, 3L, 1L, 1L, 
                   1L, 1L, 5L, 5L, 8L, 8L, 2L, 1L, 2L, 1L, 10L, 10L, 10L, 10L, 10L, 
                   10L, 10L, 10L, 9L, 9L, 1L, 1L, 8L, 1L, 8L, 1L, 8L, 8L, 2L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                   1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 
                   1L, 2L, 7L, 1L, 2L, 7L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
                   1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 
                   10L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 3L, 1L, 1L, 1L, 4L, 3L, 1L, 
                   1L, 1L, 4L, 25L, 1L, 1L, 25L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L)

binIdexVector <- numeric(length(Volume))

# initialize 
binIdex <-1
totalVolume <-0

for(i in seq_len(length(Volume))){

  totalVolume <- totalVolume + Volume[i]  

  if (totalVolume <= 100) {

    binIdexVector[i] <- binIdex

  } else {

    binIdex <- binIdex + 1
    binIdexVector[i] <- binIdex
    totalVolume <- Volume[i]
  }
}

# output:
> dput(binIdexVector)
c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
  2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
  4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 
  10, 10, 10, 10, 10, 10, 10, 10, 10, 10)
Run Code Online (Sandbox Code Playgroud)

非常感谢您的帮助!

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.2
Run Code Online (Sandbox Code Playgroud)

Kha*_*haa 12

当矢量化困难时,您可以使用Rcpp.

library(Rcpp)
cppFunction('
  IntegerVector bin(NumericVector Volume, int n) {
    IntegerVector binIdexVector(Volume.size());
    int binIdex = 1;
    double totalVolume =0;

    for(int i=0; i<Volume.size(); i++){
      totalVolume = totalVolume + Volume[i];
      if (totalVolume <= n) {
        binIdexVector[i] = binIdex;
      } else {
        binIdex++;
        binIdexVector[i] = binIdex;
        totalVolume = Volume[i];
      }
    }
    return binIdexVector;
  }')

all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE
Run Code Online (Sandbox Code Playgroud)

它比findInterval(cumsum(Volume), seq(0, sum(Volume), by=100))(这当然给出一个不精确的答案)更快