经常我使用的功能group_by()和summarize()(注意:这是相同的count()功能,如果摘要统计是sum())功能在dplyr包中R。
下面是一个示例:
library(dplyr)
data <- data.frame(
group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
var1 = sample(1:16)
)
Run Code Online (Sandbox Code Playgroud)
这是输出:
out_df <-
data %>%
group_by(group) %>%
summarize(sum_var1 = sum(var1))
print(out_df)
Source: local data frame [7 x 3]
Groups: group [4]
group factor sum_var1
<fctr> <fctr> <int>
1 Group A Factor 1 29
2 Group B Factor 1 8
3 Group C Factor 1 33
4 Group D Factor 1 12
5 Group A Factor 2 27
6 Group B Factor 2 10
7 Group C Factor 2 17
Run Code Online (Sandbox Code Playgroud)
现在,我很多次想找到每个sum_var1变量的比例,不是作为总和的比例,而是作为一个因子水平的总和的比例,例如factor这里的变量。
我通常通过找到因子的每个级别的总和,然后手动将观察除以它来做到这一点,如下所示:
out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide
Run Code Online (Sandbox Code Playgroud)
这会导致所需的输出,我可以检查sumof factor_prop_sum_var1equals 1:
out_df
Source: local data frame [8 x 4]
Groups: group [4]
group factor sum_var1 factor_prop_sum_var1
<fctr> <fctr> <int> <dbl>
1 Group A Factor 1 26 0.3170732
2 Group B Factor 1 17 0.2073171
3 Group C Factor 1 19 0.2317073
4 Group D Factor 1 18 0.2195122
5 Group A Factor 2 8 0.1481481
6 Group B Factor 2 19 0.3518519
7 Group C Factor 2 7 0.1296296
8 Group D Factor 2 22 0.4074074
out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1))
# A tibble: 2 × 2
factor checking
<fctr> <dbl>
1 Factor 1 1
2 Factor 2 1
Run Code Online (Sandbox Code Playgroud)
这有效,但充其量是非常笨重的。有没有办法更优雅地(最好在dplyr“管道”内)做到这一点?
要获得组内的比例,只需按您希望比例加到 100% 的列进行分组。所以,在这种情况下,得到了总和的每个组合后group和factor使用group_by了,但这次组只由factor然后计算百分比。
library(dplyr)
set.seed(100)
data <- data.frame(
group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
var1 = sample(1:16)
)
data %>%
group_by(group, factor) %>%
summarize(sum_var1 = sum(var1)) %>%
group_by(factor) %>%
mutate(percent = sum_var1/sum(sum_var1)) %>%
arrange(factor)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)group factor sum_var1 percent 1 Group A Factor 1 13 0.25000000 2 Group B Factor 1 8 0.15384615 3 Group C Factor 1 21 0.40384615 4 Group D Factor 1 10 0.19230769 5 Group A Factor 2 20 0.23809524 6 Group B Factor 2 27 0.32142857 7 Group C Factor 2 2 0.02380952 8 Group D Factor 2 35 0.41666667