我有一个从xmlAttrs返回的命名字符向量,如下所示:
testVect <- structure(c("11.2.0.3.0", "12.89", "12.71"), .Names = c("db_version",
"elapsed_time", "cpu_time"))
Run Code Online (Sandbox Code Playgroud)
我想将其转换为如下所示的数据框:
testDF <- data.frame("db_version"="11.2.0.3.0","elapsed_time"=12.89,"cpu_time"=12.71)
head(testDF)
db_version elapsed_time cpu_time
1 11.2.0.3.0 12.89 12.71
Run Code Online (Sandbox Code Playgroud) 我有几个大数据帧(100万+行x 6-10列)我需要重复子集.子集化部分是我的代码中最慢的部分,我很好奇是否有办法更快地完成这项工作.
load("https://dl.dropbox.com/u/4131944/Temp/DF_IOSTAT_ALL.rda")
start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M")
end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M")
system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
> system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
user system elapsed
16.59 0.00 16.60
dput(head(DF_IOSTAT_ALL))
structure(list(date_stamp = structure(list(sec = c(14, 24, 34,
44, 54, 4), min = c(0L, 0L, 0L, 0L, 0L, 1L), hour = c(0L, 0L,
0L, 0L, 0L, 0L), mday = c(20L, 20L, 20L, 20L, 20L, 20L), mon = c(7L,
7L, 7L, 7L, …Run Code Online (Sandbox Code Playgroud) 我正在尝试编写一个函数,它接受数据框的名称和列,以便使用dplyr进行汇总,然后返回汇总数据框.我已经尝试了lazyeval包中的一堆interp(),但是我花了太多时间试图让它工作.所以,我写了一个我想要的函数的"静态"版本:
summarize.df.static <- function(){
temp_df <- mtcars %>%
group_by(cyl) %>%
summarize(qsec = mean(qsec),
mpg=mean(mpg))
return(temp_df)
}
new_df <- summarize.df.static()
head(new_df)
Run Code Online (Sandbox Code Playgroud)
这是我坚持的动态版本的开始:
summarize.df.dynamic <- function(df_in,sum_metric_in){
temp_df <- df_in %>%
group_by(cyl) %>%
summarize_(qsec = mean(qsec),
sum_metric_in=mean(sum_metric_in)) # some mix of interp()
return(temp_df)
}
new_df <- summarize.df.dynamic(mtcars,"mpg")
head(new_df)
Run Code Online (Sandbox Code Playgroud)
请注意,我希望此示例中的列名称也来自传入的参数(在本例中为mpg).另请注意,qsec列是静态的,即不传入.
以下是"docendo discimus"发布的正确答案:
summarize.df.dynamic<- function(df_in, sum_metric_in){
temp_df <- df_in %>%
group_by(cyl) %>%
summarize_(qsec = ~mean(qsec),
xyz = interp(~mean(var), var = as.name(sum_metric_in)))
names(temp_df)[names(temp_df) == "xyz"] <- sum_metric_in
return(temp_df)
}
new_df <- summarize.df.dynamic(mtcars,"mpg")
head(new_df)
# cyl qsec …Run Code Online (Sandbox Code Playgroud) 我有一个数据框,缺少"SNAP_ID"的值.我想基于前一个非缺失值(lag()?)的序列用浮点值填充缺失值.如果可能的话,我真的想用dplyr实现这个目的.
假设:
目前的数据:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
Run Code Online (Sandbox Code Playgroud)
我想要实现的目标:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
Run Code Online (Sandbox Code Playgroud)
作为数据框架:
df <- structure(list(end = structure(c(1435323540, …Run Code Online (Sandbox Code Playgroud) 我有从数据库中每隔一段时间收集的数据。指标是计数器,不断增加。要获得给定时间的度量值,您必须从同一行的先前版本中减去一行。
例子:
TS INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
2014-01-29 17:20:36 1 log file sync 1 756873
2014-01-29 17:20:36 1 log file sync 2 15627
2014-01-29 17:20:36 1 log file sync 4 2925
2014-01-29 17:21:03 1 log file sync 1 761063
2014-01-29 17:21:03 1 log file sync 2 15659
2014-01-29 17:21:03 1 log file sync 4 2929
Run Code Online (Sandbox Code Playgroud)
期望输出:
TS INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
2014-01-29 17:21:03 1 log file sync 1 4190
2014-01-29 17:21:03 1 log file sync 2 32
2014-01-29 17:21:03 1 …Run Code Online (Sandbox Code Playgroud)