J.S*_*ree 7 r subset counting dplyr summarize
我有一个想要总结的数据集。首先,我想要主客场比赛的总和,这是我可以做到的。但是,我还想知道每个子类别(主场、客场)中有多少个异常值(定义为超过 300 分)。
如果我没有使用summary,我知道dplyr有这个count()功能,但我希望这个解决方案出现在我的summarize()通话中。这是我所拥有的和我尝试过的但未能执行的内容:
#Test data
library(dplyr)
test <- tibble(score = c(100, 150, 200, 301, 150, 345, 102, 131),
location = c("home", "away", "home", "away", "home", "away", "home", "away"),
more_than_300 = c(FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE))
#attempt 1, count rows that match a criteria
test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = nrow(.[more_than_300 == FALSE]))
Run Code Online (Sandbox Code Playgroud)
您可以在逻辑向量上使用sum- 它会自动将它们转换为数值(TRUE等于 1 和FALSE等于 0),因此您只需执行以下操作:
test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(more_than_300))
#> # A tibble: 2 x 3
#> location total_score n_outliers
#> <chr> <dbl> <int>
#> 1 away 927 2
#> 2 home 552 0
Run Code Online (Sandbox Code Playgroud)
或者,如果这是您仅有的 3 列,则等效内容为:
test %>%
group_by(location) %>%
summarize(across(everything(), sum))
Run Code Online (Sandbox Code Playgroud)
事实上,您不需要创建该more_than_300列 - 只需执行以下操作就足够了:
test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(score > 300))
Run Code Online (Sandbox Code Playgroud)