如何使用Dplyr的Summarize和which()来查找最小/最大值

dre*_*ww2 18 r dplyr

我有以下数据:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)
Run Code Online (Sandbox Code Playgroud)

我想用dplyr来

(1)按"组"分组数据(2)显示每组内的最小和最大年龄(3)显示最小和最大年龄的人的姓名

以下代码执行此操作:

data %>% group_by(Group) %>%
     summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
               maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Run Code Online (Sandbox Code Playgroud)

哪个效果很好:

  Group minAge minAgeName maxAge maxAgeName
1     A     22        Sam     22        Sam
2     B     12      Sarah     58      James
3     C     17     Andrew     82      Sally
4     D     12     Mairin     67        Ray
Run Code Online (Sandbox Code Playgroud)

但是,如果有多个最小值或最大值,我会遇到问题:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)

> data %>% group_by(Group) %>%
+   summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
+             maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value
Run Code Online (Sandbox Code Playgroud)

我正在寻找两种解决方案:

(1)无论显示哪个最小或最大名称无关紧要,只显示一个(即找到的第一个值)(2)如果有"关系",则显示所有最小值和最大值

如果不清楚请提前告知我们,并提前致谢!

sha*_*dow 21

您可以使用which.minwhich.max获取第一个值.

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = Name[which.min(Age)], 
            maxAge = max(Age), maxAgeName = Name[which.max(Age)])
Run Code Online (Sandbox Code Playgroud)

要获取所有值,请使用例如带有适当collapse参数的paste .

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "), 
            maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
Run Code Online (Sandbox Code Playgroud)


A5C*_*2T1 12

我实际上建议您将数据保持为"长"格式.这是我如何处理这个问题:

library(dplyr)
Run Code Online (Sandbox Code Playgroud)

有关系时保持所有价值观:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%  ## optional
  filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3    Jim  31     B
# 4  James  58     B
# 5 Andrew  17     C
# 6  Sally  82     C
# 7 Mairin  12     D
# 8    Ray  67     D
Run Code Online (Sandbox Code Playgroud)

有关系时只保留一个值:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3  James  58     B
# 4 Andrew  17     C
# 5  Sally  82     C
# 6 Mairin  12     D
# 7    Ray  67     D
Run Code Online (Sandbox Code Playgroud)

如果你真的想要一个"宽"的数据集,基本的概念将是gatherspread数据,使用"tidyr":

library(dplyr)
library(tidyr)

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(c(1, n())) %>%
  mutate(minmax = c("min", "max")) %>%
  gather(var, val, Name:Age) %>%
  unite(key, minmax, var) %>%
  spread(key, val)
# Source: local data frame [4 x 5]
# 
#   Group max_Age max_Name min_Age min_Name
# 1     A      22      Sam      22      Sam
# 2     B      58    James      31    Sarah
# 3     C      82    Sally      17   Andrew
# 4     D      67      Ray      12   Mairin
Run Code Online (Sandbox Code Playgroud)

虽然你想要的关系的广泛形式尚不清楚.