R中可能使用Rcpp的大型数据文件的优化循环

use*_*424 5 performance loops r rcpp

我在R中有一个循环,它很慢(但可以)。目前,此计算在我的笔记本电脑上大约需要3分钟,我认为它可以改进。最终,我将遍历许多基于此代码结果运行计算的数据文件,并且我希望尽可能使当前代码更快。

基本上,对于每个日期,对于11个不同的X值,循环都会获取最近X年的降雨值(Y),找到线性反权重(Z),以便最古老的降雨值的加权最小,是雨的倍数(Y)和权重(Z)以获得向量A,然后将A的总和作为最终结果。这完成了数千个日期。

但是,我想不出任何办法或以任何方式寻求建议,以使其在R中更快,因此我尝试在Rcpp中重写它,但我对此知之甚少。我的Rcpp代码没有完全复制R代码,因为结果矩阵与应有的矩阵不同(错误)(out1 vs out2;我知道out1是正确的)。似乎Rcpp代码更快,但是我只能使用几列进行测试,因为如果我尝试运行所有11列(i <= 10),它就会开始崩溃(RStudio中的致命错误)。

我正在寻找有关如何改进R代码和/或更正Rcpp代码以提供正确结果而不会在过程中崩溃的反馈。

(尽管我下面发布的代码未显示出来,但数据仍以[作为数据框的形式]加载到R中,用于在所示代码之外进行的一些计算。对于此处显示的特定计算,仅列使用了数据帧中的2个。)

数据文件位于以下位置:https : //drive.google.com/file/d/0Bw_Ca37oxVmJekFBR2t4eDdKeGM/view?usp=sharing

尝试R

library(readxl)

library(readxl)
library(Rcpp)
file = data.frame(read_excel("lake.xlsx", trim_ws=T)
col_types=c("date","numeric","numeric","date",rep("numeric",4),"text")))
file[,1] = as.Date(file[,1], "%Y/%m/%d", tz="UTC")
file[,4] = as.Date(file[,4], "%Y/%m/%d", tz="UTC")

rainSUM = function(df){
rainsum = data.frame("6m"=as.numeric(), "1yr"=as.numeric(), "2yr"=as.numeric(), "3yr"=as.numeric(), "4yr"=as.numeric(), "5yr"=as.numeric(), "6yr"=as.numeric(), "7yr"=as.numeric(), "8yr"=as.numeric(), "9yr"=as.numeric(), "10yr"=as.numeric()) # create dataframe for storing the sum of weighted last d values

  Tdays <- length(df[,1])

  for(i in 1:11) {           # loop through the lags
    if (i==1) {
      d <- 183               # 6 month lag only has 183 days,
    } else {
      d <- (i-1)*366         # the rest have 366 days times the number of years
    }
    w <- 0:(d-1)/d

    for(k in 1:Tdays) {      # loop through rows of rain dataframe (k = row)
      if(d>k){               # get number of rain values needed for the lag
        rainsum[k,i] <- sum(df[1:k,2] * w[(d-k+1):d])                                
      } else{
        rainsum[k,i] <- sum(df[(k-d+1):k,2] * w)                          
      }
    }
  }
  return(rainsum)
}

out1 <- rainSUM(file)
Run Code Online (Sandbox Code Playgroud)

在Rcpp中尝试

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]

NumericVector myseq(int first, int last) {  // simulate R's X:Y sequence (step of 1)
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(i);
  return(y);
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(vec[i]);
  return(y);
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {             // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(1,d);            // sequence 1:d; length d
  NumericVector b = (a-1)/a;               // inverse linear
  return(b);                               // return vector                              
} 

// [[Rcpp::export]]              
NumericMatrix rainsumCPP(DataFrame df, int raincol) {
  NumericVector q(0);
  NumericMatrix rainsum(df.nrows(), 11);       // matrix with number of row days as data file and 11 columns 
  NumericVector p = df( raincol-1 );           // grab rain values (remember C++ first index is 0)
  for(int i = 0; i <= 10; i++) {                // loop through 11 columns (C++ index starts at 0!)
    if (i==0) {
      int d = 183;                             // 366*years lag days 
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what's available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                              // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      }
    }
    else {
      int d = i*366;                           // 183 lag days if column 0
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what's available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                             // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      } 
    }
  }
  return(rainsum);
}


/*** R
out2 = rainsumCPP(file, raincol) # raincol currently = 2
  */
Run Code Online (Sandbox Code Playgroud)

coa*_*ess 2

恭喜!您有索引越界 (OOB)错误,导致未定义的行为 (UB)!您可以在将来通过将向量访问器从 更改为 以及将矩阵访问器从 更改为 来检测[]()()一点.at()

切换到这些访问器会产生:

Error in rainsumCPP(file, 2) : 
  Index out of bounds: [index=182; extent=182].
Run Code Online (Sandbox Code Playgroud)

这表明索引超出范围,因为索引必须始终位于小于范围的 0 到 1 之间(例如向量的长度 - 1)。

初步看来,这个问题很大程度上是由于没有正确地将从一开始的索引映射到从零开始的索引造成的。

在使用myseq()splicer()weighty()函数时,它们与给定输入的R等效项匹配。这可以通过使用来检查。这种不匹配分为两部分: 1.和的边界2. 内部完成的反转all.equal(R_result, Rcpp_Result)myseqsplicerweighty

因此,通过使用以下经过修改的函数,您应该为获得正确的结果奠定了良好的基础。

// [[Rcpp::export]]
NumericVector myseq(int first, int last) {  // simulate R's X:Y sequence (step of 1)
  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = count;
    count++;
  }

  return y;
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer

  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = vec(i);

    count++;
  }

  return y;
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {       // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(0, d - 1); // (fixed) sequence 1:d; length d
  NumericVector b = a / d;           // (fixed) inverse linear
  return(b);                         // return vector                              
}
Run Code Online (Sandbox Code Playgroud)

从那里,您可能需要修改,rainsumCpp因为没有给出R等效项的输出。