jen*_*irf 126 group-by r frequency dplyr
假设我想计算每组中不同值的比例.例如,使用所述mtcars数据,如何计算相对数量的频率齿轮由点(自动/手动)一气呵成与dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
Run Code Online (Sandbox Code Playgroud)
我想要实现的目标:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Run Code Online (Sandbox Code Playgroud)
Hen*_*rik 244
试试这个:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
Run Code Online (Sandbox Code Playgroud)
从dplyr小插图:
当您按多个变量分组时,每个摘要都会剥离一个分组级别.这使得逐步汇总数据集变得容易.
因此,在之后summarise,剥离分组变量'gear',然后将数据"仅"分组为'am'(仅groups在结果数据上检查),然后我们在其上执行mutate计算.
"剥离"的结果当然取决于group_by调用中分组变量的顺序.这次我们很幸运,它剥离了所需的变量.您可能希望进行后续操作group_by(am),以使您的代码更加明确.
有关舍入和美化,请参阅@Tyler Rinker的好答案.
Mat*_*fou 34
您可以使用count()函数,但具有不同的行为,具体取决于以下版本dplyr:
dplyr 0.7.1:返回一个未分组的表:你需要再次分组am
dplyr <0.7.1:返回一个分组表,因此不需要再次分组,尽管您可能希望ungroup()以后进行操作
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
Run Code Online (Sandbox Code Playgroud)
dplyr <0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
Run Code Online (Sandbox Code Playgroud)
这会导致分组表,如果要将其用于进一步分析,则删除分组属性可能很有用ungroup().
Tyl*_*ker 26
@ Henrik的可用性更好,因为这将使列字符不再是数字但匹配你要求的...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Run Code Online (Sandbox Code Playgroud)
编辑因为Spacedman要求它:-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Run Code Online (Sandbox Code Playgroud)
jos*_*rrà 11
为了这个热门问题的完整性,从1.0.0版本开始dplyr,参数.groups控制了summary helpsummarise之后函数的分组结构。group_by
使用 时.groups = "drop_last",summarise会删除最后一级的分组。这是1.0.0版本之前获得的唯一结果。
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)
使用 时.groups = "drop",所有级别的分组都会被删除。结果变成一个独立的小标题,没有之前的痕迹group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
Run Code Online (Sandbox Code Playgroud)
如果.groups = "keep",则与 .data 相同的分组结构(在本例中为 mtcars)。summarise不会剥离 中使用的任何变量group_by。
最后,对于.groups = "rowwise",每一行都是它自己的组。在这种情况下相当于“保留”
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Run Code Online (Sandbox Code Playgroud)
另一点可能有趣的是,有时,在应用group_by和后summarise,摘要行会有所帮助。
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Run Code Online (Sandbox Code Playgroud)
由reprex 包于 2020 年 11 月 9 日创建(v0.3.0)
希望您觉得这个答案有用。
我为这个重复任务写了一个小函数:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
Run Code Online (Sandbox Code Playgroud)
然后我可以像这样使用它:
mtcars %>%
group_by(cyl) %>%
count_pct
Run Code Online (Sandbox Code Playgroud)
它返回:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
Run Code Online (Sandbox Code Playgroud)
尽管有很多答案,但还有一种prop.table与dplyror结合使用的方法data.table。
library("dplyr")
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = prop.table(n))
library("data.table")
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n) , by = "am"]
Run Code Online (Sandbox Code Playgroud)
这是在dplyr0.7.1 上实现Henrik解决方案的常规功能。
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Run Code Online (Sandbox Code Playgroud)
小智 5
另外,尝试add_count() (绕过讨厌的 group_by .groups)。
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Run Code Online (Sandbox Code Playgroud)