Tyl*_*uth 6 performance r subset dataframe
我有几个大数据帧(100万+行x 6-10列)我需要重复子集.子集化部分是我的代码中最慢的部分,我很好奇是否有办法更快地完成这项工作.
load("https://dl.dropbox.com/u/4131944/Temp/DF_IOSTAT_ALL.rda")
start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M")
end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M")
system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
> system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
user system elapsed
16.59 0.00 16.60
dput(head(DF_IOSTAT_ALL))
structure(list(date_stamp = structure(list(sec = c(14, 24, 34,
44, 54, 4), min = c(0L, 0L, 0L, 0L, 0L, 1L), hour = c(0L, 0L,
0L, 0L, 0L, 0L), mday = c(20L, 20L, 20L, 20L, 20L, 20L), mon = c(7L,
7L, 7L, 7L, 7L, 7L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(1L, 1L, 1L, 1L, 1L, 1L), yday = c(232L, 232L, 232L,
232L, 232L, 232L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt")), cpu = c(0.9, 0.2, 0.2, 0.1,
0.2, 0.1), rsec_s = c(0, 0, 0, 0, 0, 0), wsec_s = c(0, 3.8, 0,
0.4, 0.2, 0.2), util_pct = c(0, 0.1, 0, 0, 0, 0), node = c("bda101",
"bda101", "bda101", "bda101", "bda101", "bda101")), .Names = c("date_stamp",
"cpu", "rsec_s", "wsec_s", "util_pct", "node"), row.names = c(NA,
6L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
我会为此使用 xts。唯一潜在的问题是 xts 是一个具有有序索引属性的矩阵,因此您不能像在 data.frame 中那样混合类型。
如果节点列是不变的,您可以将其从 xts 对象中排除:
library(xts)
x <- xts(DF_IOSTAT_ALL[,2:5], as.POSIXct(DF_IOSTAT_ALL$date_stamp))
x["2012-08-20 00:00:24/2012-08-20 00:00:54"]
Run Code Online (Sandbox Code Playgroud)
使用OP的实际数据进行更新:
Data <- DF_IOSTAT_ALL
# change node from character to numeric,
# so it can exist in the xts object too.
Data$node <- as.numeric(gsub("^bda","",Data$node)
# create the xts object
x <- xts(Data[,-1], as.POSIXct(Data$date_stamp))
# subset one day
system.time(x['2012-08-20 13:00/2012-08-20 17:00'])
# user system elapsed
# 0 0 0
# subset 13:00-17:00 for all days
system.time(x['T13:00/T17:00'])
# user system elapsed
# 2.64 0.00 2.66
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1232 次 |
| 最近记录: |