从分组数据中选择第一行和最后一行

tos*_*pig 120 r dplyr

使用dplyr,如何在一个语句中选择分组数据的顶部和底部观察/行?

数据和示例

给定一个数据框架

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
                 stopId=c("a","b","c","a","b","c","a","b","c"), 
                 stopSequence=c(1,2,3,3,1,4,3,1,2))
Run Code Online (Sandbox Code Playgroud)

我可以使用每个组的顶部和底部观察结果slice,但使用两个单独的语句:

firstStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(1) %>%
  ungroup

lastStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(n()) %>%
  ungroup
Run Code Online (Sandbox Code Playgroud)

我可以将这两个statmenets合并成一个选择两个顶部和底部的意见?

jer*_*ycg 204

可能有一种更快的方法:

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  filter(row_number()==1 | row_number()==n())
Run Code Online (Sandbox Code Playgroud)

  • `rownumber()%in%c(1,n())`可以避免两次运行矢量扫描的需要 (61认同)
  • @MichaelChirico我怀疑你省略了`_`?即`filter(row_number()%in%c(1,n()))` (12认同)

Fra*_*ank 95

只是为了完整性:您可以传递slice索引向量:

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
Run Code Online (Sandbox Code Playgroud)

这使

  id stopId stopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      b            1
6  3      a            3
Run Code Online (Sandbox Code Playgroud)


Mic*_*ico 15

不是dplyr,但更直接的使用data.table:

library(data.table)
setDT(df)
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ]
#    id stopId stopSequence
# 1:  1      a            1
# 2:  1      c            3
# 3:  2      b            1
# 4:  2      c            4
# 5:  3      b            1
# 6:  3      a            3
Run Code Online (Sandbox Code Playgroud)

更详细的解释:

# 1) get row numbers of first/last observations from each group
#    * basically, we sort the table by id/stopSequence, then,
#      grouping by id, name the row numbers of the first/last
#      observations for each id; since this operation produces
#      a data.table
#    * .I is data.table shorthand for the row number
#    * here, to be maximally explicit, I've named the variable V1
#      as row_num to give other readers of my code a clearer
#      understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num

# 2) extract rows by number
df[idx]
Run Code Online (Sandbox Code Playgroud)

请务必查看入门维基以获取所data.table涵盖的基础知识

  • @ArtemKlevtsov - 不过,您可能并不总是想设置按键。 (3认同)
  • 或`df [order(stopSequence),.SD [c(1L,.N)],by = id]`。参见[此处](/sf/answers/990906521/) (2认同)

hrb*_*str 7

就像是:

library(dplyr)

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                 stopId=c("a","b","c","a","b","c","a","b","c"),
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

first_last <- function(x) {
  bind_rows(slice(x, 1), slice(x, n()))
}

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  do(first_last(.)) %>%
  ungroup

## Source: local data frame [6 x 3]
## 
##   id stopId stopSequence
## 1  1      a            1
## 2  1      c            3
## 3  2      b            1
## 4  2      c            4
## 5  3      b            1
## 6  3      a            3
Run Code Online (Sandbox Code Playgroud)

随着do你几乎可以在组,但@ jeremycg的答案执行任何数量的操作方式是更适合眼前这个任务.

  • 没有不同意(我指出jeremycg是一个更好的答案_in_帖子)但是当'slice`不起作用时(例如对一个组的更复杂的操作),这里有一个`do`示例可能会帮助其他人.并且,你将评论作为答案发布(这是最好的答案). (4认同)

Moo*_*per 7

使用which.minwhich.max

library(dplyr, warn.conflicts = F)
df %>% 
  group_by(id) %>% 
  slice(c(which.min(stopSequence), which.max(stopSequence)))

#> # A tibble: 6 x 3
#> # Groups:   id [3]
#>      id stopId stopSequence
#>   <dbl> <fct>         <dbl>
#> 1     1 a                 1
#> 2     1 c                 3
#> 3     2 b                 1
#> 4     2 c                 4
#> 5     3 b                 1
#> 6     3 a                 3
Run Code Online (Sandbox Code Playgroud)

基准

它也比当前接受的答案快得多,因为我们按组查找最小值和最大值,而不是对整个 stopSequence 列进行排序。

# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F)) 
bench::mark(
  mm =df2 %>% 
    group_by(id) %>% 
    slice(c(which.min(stopSequence), which.max(stopSequence))),
  jeremy = df2 %>%
    group_by(id) %>%
    arrange(stopSequence) %>%
    filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mm           22.6ms     27ms     34.9     14.2MB     21.3
#> 2 jeremy      254.3ms    273ms      3.66    58.4MB     11.0
Run Code Online (Sandbox Code Playgroud)


mpa*_*nco 6

我知道指定的问题dplyr.但是,由于其他人已经使用其他软件包发布了解决方案,我也决定使用其他软件包:

基础包:

df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ], 
      df[!duplicated(df$id, fromLast = TRUE), ], 
      all = TRUE)
Run Code Online (Sandbox Code Playgroud)

data.table:

df <-  setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
Run Code Online (Sandbox Code Playgroud)

sqldf:

library(sqldf)
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
sqldf("SELECT * FROM min
      UNION
      SELECT * FROM max")
Run Code Online (Sandbox Code Playgroud)

在一个查询中:

sqldf("SELECT * 
        FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)
        UNION
        SELECT *
        FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)")
Run Code Online (Sandbox Code Playgroud)

输出:

  id stopId StopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      a            3
6  3      b            1
Run Code Online (Sandbox Code Playgroud)


小智 6

这工作正常:

\n
df %>%\n  group_by(id) %>%\n  arrange(stopSequence) %>%\n  slice(1,n())\n\n# A tibble: 6 \xc3\x97 3\n# Groups:   id [3]\n#     id stopId stopSequence\n#  <dbl> <chr>         <dbl>\n#1     1 a                 1\n#2     1 c                 3\n#3     2 b                 1\n#4     2 c                 4\n#5     3 b                 1\n#6     3 a                 3\n
Run Code Online (Sandbox Code Playgroud)\n