在R中使用dplyr查找分组观察的比例

Jos*_*erg 2 r dplyr

经常我使用的功能group_by()summarize()(注意:这是相同的count()功能,如果摘要统计是sum())功能在dplyr包中R

下面是一个示例:

library(dplyr)

data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)
Run Code Online (Sandbox Code Playgroud)

这是输出:

out_df <- 
    data %>% 
        group_by(group) %>% 
        summarize(sum_var1 = sum(var1))

print(out_df)

Source: local data frame [7 x 3]
Groups: group [4]

    group   factor sum_var1
   <fctr>   <fctr>    <int>
1 Group A Factor 1       29
2 Group B Factor 1        8
3 Group C Factor 1       33
4 Group D Factor 1       12
5 Group A Factor 2       27
6 Group B Factor 2       10
7 Group C Factor 2       17
Run Code Online (Sandbox Code Playgroud)

现在,我很多次想找到每个sum_var1变量的比例不是作为总和的比例,而是作为一个因子水平的总和的比例,例如factor这里的变量。

我通常通过找到因子的每个级别的总和,然后手动将观察除以它来做到这一点,如下所示:

out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide
Run Code Online (Sandbox Code Playgroud)

这会导致所需的输出,我可以检查sumof factor_prop_sum_var1equals 1

out_df

Source: local data frame [8 x 4]
Groups: group [4]

    group   factor sum_var1 factor_prop_sum_var1
   <fctr>   <fctr>    <int>                <dbl>
1 Group A Factor 1       26            0.3170732
2 Group B Factor 1       17            0.2073171
3 Group C Factor 1       19            0.2317073
4 Group D Factor 1       18            0.2195122
5 Group A Factor 2        8            0.1481481
6 Group B Factor 2       19            0.3518519
7 Group C Factor 2        7            0.1296296
8 Group D Factor 2       22            0.4074074

out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1))

# A tibble: 2 × 2
    factor checking
    <fctr>    <dbl>
1 Factor 1        1
2 Factor 2        1
Run Code Online (Sandbox Code Playgroud)

这有效,但充其量是非常笨重的。有没有办法更优雅地(最好在dplyr“管道”内)做到这一点?

eip*_*i10 5

要获得组内的比例,只需按您希望比例加到 100% 的列进行分组。所以,在这种情况下,得到了总和的每个组合后groupfactor使用group_by了,但这次组只由factor然后计算百分比。

library(dplyr)

set.seed(100)
data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)

data %>% 
  group_by(group, factor) %>% 
  summarize(sum_var1 = sum(var1)) %>%
  group_by(factor) %>%
  mutate(percent = sum_var1/sum(sum_var1)) %>%
  arrange(factor)
Run Code Online (Sandbox Code Playgroud)
    group   factor sum_var1    percent
1 Group A Factor 1       13 0.25000000
2 Group B Factor 1        8 0.15384615
3 Group C Factor 1       21 0.40384615
4 Group D Factor 1       10 0.19230769
5 Group A Factor 2       20 0.23809524
6 Group B Factor 2       27 0.32142857
7 Group C Factor 2        2 0.02380952
8 Group D Factor 2       35 0.41666667
Run Code Online (Sandbox Code Playgroud)