矢量化包含哪个语句和函数的for循环

Question

矢量化包含哪个语句和函数的for循环

一个可重现的代码示例我正在尝试矢量化.

cutOffs <- seq(1,10,0.2)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

for(plotPoint in 1:length(cutOffs))
{
  plotOutput[plotPoint, "x"] <-
    nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])
}

plotOutput

Run Code Online (Sandbox Code Playgroud)

特别是我想要找到的是,如果有一种方法来矢量化这部分.

nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])

Run Code Online (Sandbox Code Playgroud)

假设我是使用plyr库或某种形式的应用程序,可能没有太多加速,这正是我正在寻找的.从根本上说,我试图看看是否存在一些我在搜索时忽略或设法错过的矢量化技术.

更新:

Unit: milliseconds
  expr         min          lq        mean      median          uq         max neval
  op() 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700     1
  jr()  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088     1
  dd()  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050     1
 exp()  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331     1
 nic()  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043     1
  sg()    16.66177    16.66177    16.66177    16.66177    16.66177    16.66177     1

Run Code Online (Sandbox Code Playgroud)

对我实际做的更真实的近似就是这个

# generate data
numObs <- 1e5
iris <- data.frame( Sepal.Length = sample(1:numObs), Sepal.Width = sample(1:numObs) )

cutOffs <- 1:(numObs*0.01)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

Run Code Online (Sandbox Code Playgroud)

然后是人们喜欢的任何特定方法.

一般来说,它将用于50,000 - 200,000点的数据集.

使用时有一个很大的跳跃

sum(Sepal.Length > cutOffs[plotPoint] & Sepal.Width > cutOffs[plotPoint])

Run Code Online (Sandbox Code Playgroud)

这是我最初缺少的一种更优化的方法.

到目前为止,最好的答案是sgibb的sg().关键是要意识到每个行中两个值中最低的一个是最重要的.一旦完成了那次精神飞跃,只剩下一个向量来处理,向量化是相当简单的.

# cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)

Run Code Online (Sandbox Code Playgroud)

Answer 1

sgi*_*ibb 9

我想补充一点:

sg <- function() {
  # cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
  ms <- sort.int(m)
  # use `findInterval` to find all the indices 
  # (equal to "how many numbers below") lower than the threshold
  plotOutput[,"x"] <- length(ms)-findInterval(cutOffs, ms)
  plotOutput
}

Run Code Online (Sandbox Code Playgroud)

这种方法避免了一个for或outer循环,比@ nicola的方法快4倍:

microbenchmark(sg(), nic(), dd())
#Unit: microseconds
#  expr     min       lq     mean   median       uq      max neval
#  sg()  88.726 104.5805 127.3172 123.2895 144.2690  232.441   100
# nic() 474.315 526.7780 625.0021 602.3685 706.7530  997.412   100
#  dd() 669.841 736.7800 887.4873 847.7730 976.6445 2800.930   100

identical(sg(), dd())
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

Answer 2

nic*_*ola 6

你可以使用outer:

plotOutput[,"x"]<-colSums(outer(1:nrow(iris),1:length(cutOffs),function(x,y) iris$Sepal.Length[x] > cutOffs[y] & iris$Sepal.Width[x] > cutOffs[y]))

Run Code Online (Sandbox Code Playgroud)

Answer 3

tal*_*lat 5

这并没有删除for循环,但我认为它会给你一些加速 - 随意基准测试,让我们知道它如何比较你的真实数据:

for(i in seq_along(cutOffs)) {
  x <- cutOffs[i]
  plotOutput[i, "x"] <- with(iris, sum(Sepal.Length > x & Sepal.Width > x))
}

Run Code Online (Sandbox Code Playgroud)

这是使用样本数据的一个小基准(可以说很小,但可能会给出一些指示):

library(microbenchmark)
microbenchmark(op(), jr(), dd(), exp(), nic())
Unit: microseconds
  expr      min        lq    median        uq       max neval
  op() 6745.428 7079.8185 7378.9330 9188.0175 11936.173   100
  jr() 1335.931 1405.2030 1466.9180 1728.6595  4692.748   100
  dd()  684.786  711.6005  758.7395  923.6670  4473.725   100
 exp() 1928.083 2066.0395 2165.6985 2392.7030  5392.475   100
 nic()  383.007  402.5495  439.3835  541.6395   851.488   100

Run Code Online (Sandbox Code Playgroud)

基准测试中使用的函数定义如下:

op <- function(){
  for(plotPoint in 1:length(cutOffs))
  {
    plotOutput[plotPoint, "x"] <-
      nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                         iris$Sepal.Width > cutOffs[plotPoint]), ])
  }
  plotOutput
}

jr <- function() {
  cbind(x = sapply(cutOffs, counts), y = plotOutput[,"y"])
}

dd <- function() {
  for(i in seq_along(cutOffs)) {
    x <- cutOffs[i]
    plotOutput[i, "x"] <- with(iris, sum(Sepal.Length > x & Sepal.Width > x))
  }
  plotOutput
}

exp <- function() {
  data_frame(y=cutOffs) %>% 
    rowwise() %>% 
    mutate(x = sum(iris$Sepal.Length > y & iris$Sepal.Width > y))
}

nic <- function() {
  plotOutput[,"x"]<-colSums(outer(1:nrow(iris),1:length(cutOffs),function(x,y) iris$Sepal.Length[x] > cutOffs[y] & iris$Sepal.Width[x] > cutOffs[y]))
}

Run Code Online (Sandbox Code Playgroud)

编辑说明:@nicola包含的方法现在最快

归档时间：	10 年，8 月前
查看次数：	444 次
最近记录：	10 年，8 月前