默认返回system.time

Ada*_*NYC 5 r

我必须对大数据集进行大量数据操作(主要使用data.table,RStudio).我希望监视每个步骤的运行时间,而不是在每一步上显式调用system.time().

是否有一个包或一种简单的方法来默认显示每一步的运行时间?

谢谢.

had*_*ley 5

这不完全是你要求的,但我写了time_file(https://gist.github.com/4183595)这source()是一个R文件,并运行代码,然后重写文件,插入包含每个文件多长时间的注释顶级声明开始运行.

time_file()转此:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
Run Code Online (Sandbox Code Playgroud)

进入这个:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#:    user  system elapsed
#:   0.451   0.003   0.453

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#:    user  system elapsed
#:   0.029   0.000   0.029

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#:    user  system elapsed
#:   0.008   0.000   0.008
Run Code Online (Sandbox Code Playgroud)

它没有时间代码在顶级{块中,因此您可以选择不对您不感兴趣的内容进行计时.

我认为无论如何都不会自动添加时间作为顶级效果,而不会以某种方式修改您运行代码的方式 - 即使用类似于time_file代替的方式source.

您可能想知道每个顶级操作的时序对代码整体速度的影响.好吧,用microbenchmark很容易回答;)

library(microbenchmark)
microbenchmark(
  runif(1e4), 
  system.time(runif(1e4)),
  system.time(runif(1e4), gc = FALSE)
)
Run Code Online (Sandbox Code Playgroud)

因此,时序增加了相对较少的开销(在我的计算机上为20μs),但默认gc每次调用增加约27 ms.因此,除非您有数以千计的顶级电话,否则您不太可能看到太多影响.