我正在尝试创建一个输出,按因子级别计算计数占总计数(在数据框中)的百分比,但似乎无法弄清楚如何在输出中保留分组结构。
\n我可以获得我想要除以的总计数...
\ndf %>% summarise(sum(num))\n# 15\nRun Code Online (Sandbox Code Playgroud)\n...以及按组划分的总数...
\ndf %>% group_by(species) %>% summarise(sum(num))\n# A tibble: 3 \xc3\x97 2\n# species `sum(num)`\n# <chr> <int>\n# 1 Farfantepenaeus duorarum 4\n# 2 Farfantepenaeus notialis 0\n# 3 Farfantepenaeus spp 11\nRun Code Online (Sandbox Code Playgroud)\n但我无法让它看起来像这样......
\n# ???\n# species Percent\n# <chr> <int>\n# 1 Farfantepenaeus duorarum 4 / 15 = 0.267\n# 2 Farfantepenaeus notialis 0 / 15 = 0.000\n# 3 Farfantepenaeus spp 11 / 15 = 0.733\nRun Code Online (Sandbox Code Playgroud)\n我得到的最接近的是这个,但是因为我使用了 reframe() 它返回了未分组的数据
\ndf %>% group_by(species) %>% \n summarise(factor_count=sum(num)) %>% \n # ungroup() %>% \n # Wanring: # Please use `reframe()` instead., When switching from `summarise()` \n # to `reframe()`, remember that `reframe()` always returns an ungrouped data\n reframe(percent=factor_count/sum(df$num))\n\n# A tibble: 3 \xc3\x97 1\n percent\n <dbl>\n1 0.267\n2 0 \n3 0.733\nRun Code Online (Sandbox Code Playgroud)\n数据:
\n> dput(df)\nstructure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp", \n"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum", \n"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", \n"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum", \n"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum", \n"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp", \n"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L, \n0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, \n0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L, \n236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L, \n351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L, \n343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L, \n258688L, 97737L), class = "data.frame")\nRun Code Online (Sandbox Code Playgroud)\n
r2e*_*ans 10
两个步骤:总结组总数,然后对所有内容进行百分比计算。
library(dplyr)
df %>%
summarize(Percent = sum(num), .by = species) %>%
mutate(Percent = Percent / sum(Percent))
# species Percent
# 1 Farfantepenaeus notialis 0.0000000
# 2 Farfantepenaeus spp 0.7333333
# 3 Farfantepenaeus duorarum 0.2666667
Run Code Online (Sandbox Code Playgroud)
对于您的代码:
reframe是不必要的(主要是当行数发生变化时,它通常可以用来代替summarise,但我还没有验证两者是否/哪里有显着差异),事实上,这里它会删除该species列df$在以 :开头的管道中使用df:usingdf$num会忽略自管道开始以来所做的任何事情,这意味着分组、过滤、添加/更改等在该版本的df. 当然,有些时候它是有用的,甚至是必要的,但这种情况很少见。使用xtabs。
> xtabs(num ~ species, df) |> proportions() |> as.data.frame()
species Freq
1 Farfantepenaeus duorarum 0.2666666667
2 Farfantepenaeus notialis 0.0000000000
3 Farfantepenaeus spp 0.7333333333
Run Code Online (Sandbox Code Playgroud)
将值传递给函数wt的参数count
df %>%
count(species, wt = num/sum(.$num), name = 'percent')
species percent
1 Farfantepenaeus duorarum 0.2666667
2 Farfantepenaeus notialis 0.0000000
3 Farfantepenaeus spp 0.7333333
Run Code Online (Sandbox Code Playgroud)
这里有两种替代方法:
map_veclibrary(purrr)
library(dplyr)
df %>%
summarise(sum_num = sum(num), .by=species) %>%
mutate(percent = map_vec(sum_num, ~ .x / sum(df$num)))
Run Code Online (Sandbox Code Playgroud)
# credits to @r2evans:
aggregate(num ~ species, data = df, sum) |>
transform(percent = num/sum(num))
# or:
df_sums <- aggregate(num ~ species, data = df, sum)
df_sums$percent <- df_sums$num / sum(df$num)
df_sums
Run Code Online (Sandbox Code Playgroud)
species sum_num percent
1 Farfantepenaeus notialis 0 0.0000000
2 Farfantepenaeus spp 11 0.7333333
3 Farfantepenaeus duorarum 4 0.2666667
Run Code Online (Sandbox Code Playgroud)