BACKGROUD
dplyr具有窗口功能.当您想要控制窗口函数的顺序时,您可以使用order_by.
数据
mydf <- data.frame(id = c("ana", "bob", "caroline",
"bob", "ana", "caroline"),
order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
"2015-01-01 19:20:00", "2015-01-01 09:07:00",
"2015-01-01 08:30:00", "2015-01-01 11:11:00"),
format = "%Y-%m-%d %H:%M:%S"),
value = runif(6, 10, 20),
stringsAsFactors = FALSE)
# id order value
#1 ana 2015-01-01 18:00:00 19.00659
#2 bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4 bob 2015-01-01 09:07:00 14.40996
#5 ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865
Run Code Online (Sandbox Code Playgroud)
假设您要使用lag(),您可以执行以下操作.
arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))
# id order value check
#1 ana 2015-01-01 08:30:00 17.45165 NA
#2 ana 2015-01-01 18:00:00 19.00659 17.45165
#3 bob 2015-01-01 09:07:00 14.40996 NA
#4 bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865 NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865
Run Code Online (Sandbox Code Playgroud)
但是,你能避免使用arrange()带order_by().
group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))
# id order value check
#1 ana 2015-01-01 18:00:00 19.00659 17.45165
#2 bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4 bob 2015-01-01 09:07:00 14.40996 NA
#5 ana 2015-01-01 08:30:00 17.45165 NA
#6 caroline 2015-01-01 11:11:00 14.50865 NA
Run Code Online (Sandbox Code Playgroud)
实验
我想对我想要将行号分配给新列的情况应用相同的过程.使用示例数据,您可以执行以下操作.
group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())
# id order value num
#1 ana 2015-01-01 08:30:00 17.45165 1
#2 ana 2015-01-01 18:00:00 19.00659 2
#3 bob 2015-01-01 09:07:00 14.40996 1
#4 bob 2015-01-01 18:05:00 13.64010 2
#5 caroline 2015-01-01 11:11:00 14.50865 1
#6 caroline 2015-01-01 19:20:00 12.08506 2
Run Code Online (Sandbox Code Playgroud)
我们可以省略排列线吗?看到CRAN手册,我做了以下几点.两次尝试都没有成功.
### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))
### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))
Run Code Online (Sandbox Code Playgroud)
我们怎样才能做到这一点?
我不是故意自己回答这个问题.但是,我决定分享我发现的东西,因为我没有看到很多帖子使用order_by,特别是
with_order.我的回答是用with_order()而不是order_by().
group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))
# id order value num
#1 ana 2015-01-01 18:00:00 19.00659 2
#2 bob 2015-01-01 18:05:00 13.64010 2
#3 caroline 2015-01-01 19:20:00 12.08506 2
#4 bob 2015-01-01 09:07:00 14.40996 1
#5 ana 2015-01-01 08:30:00 17.45165 1
#6 caroline 2015-01-01 11:11:00 14.50865 1
Run Code Online (Sandbox Code Playgroud)
我想看看这两种方法在速度方面是否会有任何差异.在这种情况下,它们看起来非常相似.
library(microbenchmark)
mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
"bob", "ana", "caroline"), times = 200000),
order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
length.out = 1200000),
value = runif(1200000, 10, 20),
stringsAsFactors = FALSE)
jazz1 <- function() {group_by(mydf2, id) %>%
arrange(order) %>%
mutate(num = row_number())}
jazz2 <- function() {group_by(mydf2, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))}
res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# jazz1 32 36 47.17647 38 47 12308 1e+06 a
# jazz2 32 36 47.02902 38 47 12402 1e+06 a
Run Code Online (Sandbox Code Playgroud)