R 中按组划分的总计数百分比

Nat*_*ate 8 r dplyr summarize

我正在尝试创建一个输出,按因子级别计算计数占总计数(在数据框中)的百分比,但似乎无法弄清楚如何在输出中保留分组结构。

\n

我可以获得我想要除以的总计数...

\n
df %>% summarise(sum(num))\n# 15\n
Run Code Online (Sandbox Code Playgroud)\n

...以及按组划分的总数...

\n
df %>% group_by(species) %>% summarise(sum(num))\n# A tibble: 3 \xc3\x97 2\n#   species                  `sum(num)`\n#   <chr>                         <int>\n# 1 Farfantepenaeus duorarum          4\n# 2 Farfantepenaeus notialis          0\n# 3 Farfantepenaeus spp              11\n
Run Code Online (Sandbox Code Playgroud)\n

但我无法让它看起来像这样......

\n
# ???\n#   species                     Percent\n#   <chr>                         <int>\n# 1 Farfantepenaeus duorarum       4 / 15 = 0.267\n# 2 Farfantepenaeus notialis       0 / 15 = 0.000\n# 3 Farfantepenaeus spp           11 / 15 = 0.733\n
Run Code Online (Sandbox Code Playgroud)\n

我得到的最接近的是这个,但是因为我使用了 reframe() 它返回了未分组的数据

\n
df %>% group_by(species) %>% \n  summarise(factor_count=sum(num)) %>% \n  # ungroup() %>% \n  # Wanring: # Please use `reframe()` instead., When switching from `summarise()` \n  # to `reframe()`, remember that `reframe()` always returns an ungrouped data\n  reframe(percent=factor_count/sum(df$num))\n\n# A tibble: 3 \xc3\x97 1\n  percent\n    <dbl>\n1   0.267\n2   0    \n3   0.733\n
Run Code Online (Sandbox Code Playgroud)\n

数据:

\n
> dput(df)\nstructure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp", \n"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum", \n"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", \n"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum", \n"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp", \n"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L, \n0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, \n0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L, \n236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L, \n351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L, \n343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L, \n258688L, 97737L), class = "data.frame")\n
Run Code Online (Sandbox Code Playgroud)\n

r2e*_*ans 10

两个步骤:总结组总数,然后对所有内容进行百分比计算。

library(dplyr)
df %>%
  summarize(Percent = sum(num), .by = species) %>%
  mutate(Percent = Percent / sum(Percent))
#                    species   Percent
# 1 Farfantepenaeus notialis 0.0000000
# 2      Farfantepenaeus spp 0.7333333
# 3 Farfantepenaeus duorarum 0.2666667
Run Code Online (Sandbox Code Playgroud)

对于您的代码:

  • reframe是不必要的(主要是当行数发生变化时,它通常可以用来代替summarise,但我还没有验证两者是否/哪里有显着差异),事实上,这里它会删除该species
  • (几乎)永远不要df$在以 :开头的管道中使用df:usingdf$num会忽略自管道开始以来所做的任何事情,这意味着分组、过滤、添加/更改等在该版本的df. 当然,有些时候它是有用的,甚至是必要的,但这种情况很少见。


jay*_*.sf 7

使用xtabs

> xtabs(num ~ species, df) |> proportions() |> as.data.frame()
                   species         Freq
1 Farfantepenaeus duorarum 0.2666666667
2 Farfantepenaeus notialis 0.0000000000
3      Farfantepenaeus spp 0.7333333333
Run Code Online (Sandbox Code Playgroud)


Ony*_*mbu 6

将值传递给函数wt的参数count

df %>%
    count(species, wt = num/sum(.$num), name = 'percent')

                   species   percent
1 Farfantepenaeus duorarum 0.2666667
2 Farfantepenaeus notialis 0.0000000
3      Farfantepenaeus spp 0.7333333
Run Code Online (Sandbox Code Playgroud)


Tar*_*Jae 5

这里有两种替代方法:

map_vec

library(purrr)
library(dplyr)

df %>% 
  summarise(sum_num = sum(num), .by=species) %>% 
  mutate(percent = map_vec(sum_num, ~ .x /  sum(df$num)))
Run Code Online (Sandbox Code Playgroud)

基础R:

# credits to @r2evans: 
aggregate(num ~ species, data = df, sum) |>
  transform(percent = num/sum(num))

# or:
df_sums <- aggregate(num ~ species, data = df, sum)
df_sums$percent <- df_sums$num / sum(df$num)

df_sums
Run Code Online (Sandbox Code Playgroud)
      species sum_num   percent
1 Farfantepenaeus notialis       0 0.0000000
2      Farfantepenaeus spp      11 0.7333333
3 Farfantepenaeus duorarum       4 0.2666667
Run Code Online (Sandbox Code Playgroud)

  • 它是已经存在了几十年的基本 R 函数之一(字面意思是 [41c2f73](https://github.com/wch/r-source/commit/41c2f7338c45dbf9eac99c210206bc3657bca98a)),但在我的脑海中经常被忽视。与“mutate”(除了最近的“.by=”和其他点参数之外)最大的区别之一是“mutate”允许引用在同一调用中添加/更改的列,而“transform”则不允许。 (3认同)