我有以下数据框:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- nycflights13::flights %>%
select(distance) %>%
group_by(distance) %>%
summarise(n = n()) %>%
arrange(distance) %>% ungroup()
df
#> # A tibble: 214 x 2
#> distance n
#> <dbl> <int>
#> 1 17 1
#> 2 80 49
#> 3 94 976
#> 4 96 607
#> 5 116 443
#> 6 143 439
#> 7 160 376
#> 8 169 545
#> 9 173 221
#> 10 184 5504
#> # … with 204 more rows
Run Code Online (Sandbox Code Playgroud)
我想要做的是distance按大小为 100 的bin 对列进行 bin,并n相应地对列求和。怎么能这样?
所以你会得到类似的东西:
bin_distance sum_n
1-100 1633 #(1 + 49 + 976 + 607)
101-200 21344 # (443 + ... + 5327)
#etc
Run Code Online (Sandbox Code Playgroud)
最简单的方法是通过为每 100 个值和每个组的值cut创建groupsusing来使用。seqsum
library(dplyr)
df %>%
group_by(group = cut(distance, breaks = seq(0, max(distance), 100))) %>%
summarise(n = sum(n))
# group n
# <fct> <int>
# 1 (0,100] 1633
# 2 (100,200] 21344
# 3 (200,300] 28310
# 4 (300,400] 7748
# 5 (400,500] 21292
# 6 (500,600] 26815
# 7 (600,700] 7846
# 8 (700,800] 48904
# 9 (800,900] 7574
#10 (900,1e+03] 18205
# ... with 17 more rows
Run Code Online (Sandbox Code Playgroud)
可以使用aggregatelike将其转换为基数 R
aggregate(n ~ distance,
transform(df, distance = cut(distance, breaks = seq(0, max(distance), 100))), sum)
Run Code Online (Sandbox Code Playgroud)
不同的tidyverse解决方案。它紧密遵循 @Ronak Shah 代码的逻辑,但不是cut()使用cut_width()from ggplot2:
nycflights13::flights %>%
select(distance) %>%
group_by(ints = cut_width(distance, width = 100, boundary = 0)) %>%
summarise(n = n())
ints n
<fct> <int>
1 [0,100] 1633
2 (100,200] 21344
3 (200,300] 28310
4 (300,400] 7748
5 (400,500] 21292
6 (500,600] 26815
7 (600,700] 7846
8 (700,800] 48904
9 (800,900] 7574
10 (900,1e+03] 18205
Run Code Online (Sandbox Code Playgroud)