dplyr:组中的最大值,不包括每行中的值?

Eri*_*ric 8 r greatest-n-per-group dplyr

我有一个数据框,如下所示:

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]

  g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4
Run Code Online (Sandbox Code Playgroud)

我知道如何x为每个组添加一个具有最大值的列g:

> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max
1 A 7     7
2 A 3     7
3 B 5     9
4 B 9     9
5 B 2     9
6 C 4     4
Run Code Online (Sandbox Code Playgroud)

但我想得到的是x每组的最大值g,不包括x每一行的值.

对于给定的示例,所需的输出将如下所示:

Source: local data frame [6 x 3]
Groups: g

  g x x_max x_max_exclude
1 A 7     7             3
2 A 3     7             7
3 B 5     9             9
4 B 9     9             5
5 B 2     9             9
6 C 4     4            NA
Run Code Online (Sandbox Code Playgroud)

我以为我可以使用row_number()删除特定元素并占用剩余的最大值,但是点击警告消息并-Inf输出错误:

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g

  g x x_max r x_max_exclude
1 A 7     7 1          -Inf
2 A 3     7 2          -Inf
3 B 5     9 1          -Inf
4 B 9     9 2          -Inf
5 B 2     9 3          -Inf
6 C 4     4 1          -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
Run Code Online (Sandbox Code Playgroud)

在dplyr中获取此输出的最{可读,简洁,高效}方法是什么?任何有关我的尝试使用row_number()不起作用的见解也将非常感激.谢谢您的帮助.

Ste*_*pré 6

你可以尝试:

df %>% 
  group_by(g) %>% 
  arrange(desc(x)) %>% 
  mutate(max = ifelse(x == max(x), x[2], max(x)))
Run Code Online (Sandbox Code Playgroud)

这使:

#Source: local data frame [6 x 3]
#Groups: g
#
#  g x max
#1 A 7   3
#2 A 3   7
#3 B 9   5
#4 B 5   9
#5 B 2   9
#6 C 4  NA
Run Code Online (Sandbox Code Playgroud)

基准

到目前为止,我已经在基准测试中尝试了解决方案:

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
                 x = sample(1:10, 10e5, replace = TRUE))

library(microbenchmark)

mbm <- microbenchmark(
  steven = df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x))),
  eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
           x_max2 = sort(x, decreasing = TRUE)[2], 
           x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2),
  arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
  times = 50
)
Run Code Online (Sandbox Code Playgroud)

@ Arun的data.table解决方案是最快的:

# Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval cld
#  steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448    50  b 
#    eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170    50   c
#    arun  44.48639  46.17961  54.65824  47.74142  48.9884 102.3830    50 a   
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述