我必须对大数据集进行大量数据操作(主要使用data.table,RStudio).我希望监视每个步骤的运行时间,而不是在每一步上显式调用system.time().
是否有一个包或一种简单的方法来默认显示每一步的运行时间?
谢谢.
这不完全是你要求的,但我写了time_file(https://gist.github.com/4183595)这source()是一个R文件,并运行代码,然后重写文件,插入包含每个文件多长时间的注释顶级声明开始运行.
即time_file()转此:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
Run Code Online (Sandbox Code Playgroud)
进入这个:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#: user system elapsed
#: 0.451 0.003 0.453
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#: user system elapsed
#: 0.029 0.000 0.029
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#: user system elapsed
#: 0.008 0.000 0.008
Run Code Online (Sandbox Code Playgroud)
它没有时间代码在顶级{块中,因此您可以选择不对您不感兴趣的内容进行计时.
我认为无论如何都不会自动添加时间作为顶级效果,而不会以某种方式修改您运行代码的方式 - 即使用类似于time_file代替的方式source.
您可能想知道每个顶级操作的时序对代码整体速度的影响.好吧,用microbenchmark很容易回答;)
library(microbenchmark)
microbenchmark(
runif(1e4),
system.time(runif(1e4)),
system.time(runif(1e4), gc = FALSE)
)
Run Code Online (Sandbox Code Playgroud)
因此,时序增加了相对较少的开销(在我的计算机上为20μs),但默认gc每次调用增加约27 ms.因此,除非您有数以千计的顶级电话,否则您不太可能看到太多影响.