有关系时如何汇总数据集中的前 3 个最高值

And*_*löf 4 r top-n dplyr

我有一个数据框 (my_data),即使可能存在联系,也只想计算 3 个最高值的总和。我对 R 很陌生,我已经使用了dplyr.

A tibble: 15 x 3
   city      month number
   <chr>     <chr>  <dbl>
 1 Lund      jan       12
 2 Lund      feb       12
 3 Lund      mar       18
 4 Lund      apr       28
 5 Lund      may       28
 6 Stockholm jan       15
 7 Stockholm feb       15
 8 Stockholm mar       30
 9 Stockholm apr       30
10 Stockholm may       10
11 Uppsala   jan       22
12 Uppsala   feb       30
13 Uppsala   mar       40
14 Uppsala   apr       60
15 Uppsala   may       30
Run Code Online (Sandbox Code Playgroud)

这是我试过的代码:

# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
Run Code Online (Sandbox Code Playgroud)

预期的(想要的)输出是:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     75
3 Uppsala      130
Run Code Online (Sandbox Code Playgroud)

但实际的 R 输出是:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     90
3 Uppsala      160
Run Code Online (Sandbox Code Playgroud)

似乎如果有平局,所有平局值都包含在总和中。我只想计算具有最高值的 3 个唯一实例。

任何帮助将非常感激!:)

akr*_*run 5

我们可以做 adistinct来删除重复的元素。在这种方式top_n的工作原理是,如果值是重复的,它会继续,许多受骗者行

my_data %>% 
   distinct(city, number, .keep_all = TRUE) %>%
   group_by(city) %>%
   top_n(3, number) %>%
   summarise(top_nr = sum(number))
Run Code Online (Sandbox Code Playgroud)

更新

基于OP的新输出,在top_n输出(不是arranged)之后,得到按降序排列的'number',并得到sum前3个'number'的

my_data %>% 
   group_by(city) %>% 
   top_n(3, number) %>% 
   arrange(city,  desc(number)) %>% 
   summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
#  city      number
#  <chr>      <int>
#1 Lund          74
#2 Stockholm     75
#3 Uppsala      130
Run Code Online (Sandbox Code Playgroud)

数据

my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund", 
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm", 
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan", 
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may", 
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L, 
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))
Run Code Online (Sandbox Code Playgroud)


utu*_*bun 5

如果没有top_n():生活可能会更简单:

dat %>%
  group_by(city) %>%
  summarize(
    top_nr = sum(tail(sort(number), 3))
    )
Run Code Online (Sandbox Code Playgroud)