在R中的长data.frames上有效使用函数

And*_*ton 6 r

我有一个长数据框,其中包含来自桅杆的气象数据.它包含在不同高度()data$value的不同参数(风速,方向,气温等data$param)的同时拍摄的观测值(data$z)

我试图有效地切片这些数据$time,然后将函数应用于收集的所有数据.通常,功能一次应用于单个$param(即,我对风速应用不同的函数而不是空气温度).

目前的做法

我目前的方法是使用data.frameddply.

如果我想获得所有风​​速数据,我运行:

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]
Run Code Online (Sandbox Code Playgroud)

然后我运行我的函数df使用ddply():

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })
Run Code Online (Sandbox Code Playgroud)

通常V1和V2是对其他功能的调用.这些只是一些例子.我确实需要在相同的数据上运行多个函数.

我目前的方法慢.我没有对它进行基准测试,但它足够慢,我可以去喝咖啡,然后在一年的数据处理之前回来.

我有订单(百)塔要处理,每个都有一年的数据和10-12个高度,所以我正在寻找更快的东西.

数据样本

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

edd*_*ddi 14

用途data.table:

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000
Run Code Online (Sandbox Code Playgroud)

如果你想为每个参数应用不同的函数,这里有一个更统一的方法.

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02
Run Code Online (Sandbox Code Playgroud)

PS我认为的事实,我必须使用param前以某种方式evaleval工作是一个错误.


更新:版本1.8.11开始,此错误已得到修复,以下工作:

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]
Run Code Online (Sandbox Code Playgroud)

  • 第二种方法很有趣但是在可读性的极限(对我而言).我已经使用了第一个,使用`list(V1 = myFunction1(value,z),V2 = myFunction2(value,z))`.加速大约是100倍. (2认同)

had*_*ley 9

使用dplyr.它仍在开发中,但它比plyr快得多:

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))
Run Code Online (Sandbox Code Playgroud)

dplyr的另一个优点是你可以使用数据表作为后端,而无需了解data.table的特殊语法:

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))
Run Code Online (Sandbox Code Playgroud)

(带有数据帧的dplyr比plyr快20-100倍,带有data.table的dplyr大约快10倍).dplyr远不如data.table那么简洁,但是它具有数据分析的每个主要任务的功能,我发现这使得代码更容易理解 - 你几乎能够读取一系列dplyr操作给别人和让他们了解发生了什么.

如果您想对每个变量进行不同的汇总,我建议您将数据结构更改为" 整洁 ":

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)
Run Code Online (Sandbox Code Playgroud)

  • 你在这里不公平.@ eddi回答的那部分是复杂的,因为他保持数据范围广并展示拉伸数据.例如,他创建了一个包含函数的data.table,并为了善良而查找它们.你不是喜欢比较喜欢.顺便说一下,你不需要在`subset`中处理NA或者在每个`sum`调用中添加`na.rm`吗? (5认同)
  • 顺便说一句,要大声读出一个`data.table`查询,你说"从``````````````````````````````````````` 它实际上只有3个简单的参数:DT [`i`,`j`,`by`].SQL用户可能更容易点击. (3认同)