题
使用dplyr,如何在一个语句中选择分组数据的顶部和底部观察/行?
数据和示例
给定一个数据框架
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
Run Code Online (Sandbox Code Playgroud)
我可以使用每个组的顶部和底部观察结果slice,但使用两个单独的语句:
firstStop <- df %>%
group_by(id) %>%
arrange(stopSequence) %>%
slice(1) %>%
ungroup
lastStop <- df %>%
group_by(id) %>%
arrange(stopSequence) %>%
slice(n()) %>%
ungroup
Run Code Online (Sandbox Code Playgroud)
我可以将这两个statmenets合并成一个选择两个顶部和底部的意见?
jer*_*ycg 204
可能有一种更快的方法:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
Run Code Online (Sandbox Code Playgroud)
Fra*_*ank 95
只是为了完整性:您可以传递slice索引向量:
df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
Run Code Online (Sandbox Code Playgroud)
这使
id stopId stopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 b 1
6 3 a 3
Run Code Online (Sandbox Code Playgroud)
Mic*_*ico 15
不是dplyr,但更直接的使用data.table:
library(data.table)
setDT(df)
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ]
# id stopId stopSequence
# 1: 1 a 1
# 2: 1 c 3
# 3: 2 b 1
# 4: 2 c 4
# 5: 3 b 1
# 6: 3 a 3
Run Code Online (Sandbox Code Playgroud)
更详细的解释:
# 1) get row numbers of first/last observations from each group
# * basically, we sort the table by id/stopSequence, then,
# grouping by id, name the row numbers of the first/last
# observations for each id; since this operation produces
# a data.table
# * .I is data.table shorthand for the row number
# * here, to be maximally explicit, I've named the variable V1
# as row_num to give other readers of my code a clearer
# understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num
# 2) extract rows by number
df[idx]
Run Code Online (Sandbox Code Playgroud)
请务必查看入门维基以获取所data.table涵盖的基础知识
就像是:
library(dplyr)
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
first_last <- function(x) {
bind_rows(slice(x, 1), slice(x, n()))
}
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
do(first_last(.)) %>%
ungroup
## Source: local data frame [6 x 3]
##
## id stopId stopSequence
## 1 1 a 1
## 2 1 c 3
## 3 2 b 1
## 4 2 c 4
## 5 3 b 1
## 6 3 a 3
Run Code Online (Sandbox Code Playgroud)
随着do你几乎可以在组,但@ jeremycg的答案执行任何数量的操作方式是更适合眼前这个任务.
使用which.min和which.max:
library(dplyr, warn.conflicts = F)
df %>%
group_by(id) %>%
slice(c(which.min(stopSequence), which.max(stopSequence)))
#> # A tibble: 6 x 3
#> # Groups: id [3]
#> id stopId stopSequence
#> <dbl> <fct> <dbl>
#> 1 1 a 1
#> 2 1 c 3
#> 3 2 b 1
#> 4 2 c 4
#> 5 3 b 1
#> 6 3 a 3
Run Code Online (Sandbox Code Playgroud)
基准
它也比当前接受的答案快得多,因为我们按组查找最小值和最大值,而不是对整个 stopSequence 列进行排序。
# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F))
bench::mark(
mm =df2 %>%
group_by(id) %>%
slice(c(which.min(stopSequence), which.max(stopSequence))),
jeremy = df2 %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 mm 22.6ms 27ms 34.9 14.2MB 21.3
#> 2 jeremy 254.3ms 273ms 3.66 58.4MB 11.0
Run Code Online (Sandbox Code Playgroud)
我知道指定的问题dplyr.但是,由于其他人已经使用其他软件包发布了解决方案,我也决定使用其他软件包:
基础包:
df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ],
df[!duplicated(df$id, fromLast = TRUE), ],
all = TRUE)
Run Code Online (Sandbox Code Playgroud)
data.table:
df <- setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
Run Code Online (Sandbox Code Playgroud)
sqldf:
library(sqldf)
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId")
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId")
sqldf("SELECT * FROM min
UNION
SELECT * FROM max")
Run Code Online (Sandbox Code Playgroud)
在一个查询中:
sqldf("SELECT *
FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId)
UNION
SELECT *
FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId)")
Run Code Online (Sandbox Code Playgroud)
输出:
id stopId StopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 a 3
6 3 b 1
Run Code Online (Sandbox Code Playgroud)
小智 6
这工作正常:
\ndf %>%\n group_by(id) %>%\n arrange(stopSequence) %>%\n slice(1,n())\n\n# A tibble: 6 \xc3\x97 3\n# Groups: id [3]\n# id stopId stopSequence\n# <dbl> <chr> <dbl>\n#1 1 a 1\n#2 1 c 3\n#3 2 b 1\n#4 2 c 4\n#5 3 b 1\n#6 3 a 3\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
76979 次 |
| 最近记录: |