我有一个包含大约 6000 万个观察值的数据框。在 R 中按组计算观察数量的最有效方法是什么?我尝试了 group_by() %>% summarise(n = n()) 和 count()。在我的 Windows 10 PC(i9-9900k,64 GB)上,两者都花费了太长的时间。我将不胜感激你的小费。谢谢。
也许data.table效率会更高一些。
编辑: 基准
编辑#2:扩展基准;使用data.frame而不是data.table
library(microbenchmark)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
set.seed(123)
df <- data.frame(group=sample(LETTERS[1:26], 1e7, replace=TRUE),
row1=rnorm(1e7), stringsAsFactors = FALSE)
ti <- tibble(df) # for dplyr
DT <- data.table(df) # for data.table
microbenchmark(data.table=DT[, .N, by=group],
dplyr=ti %>% group_by(group) %>% tally(),
tabulate=tabulate(factor(df$group)),
table=table(df$group),
times=10L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> data.table 70.89869 83.70017 102.9951 85.22042 129.7559 165.3728 10 a
#> dplyr 163.26062 166.69943 178.0351 171.58726 173.7652 239.7959 10 b
#> tabulate 278.72801 289.18787 296.3020 294.53976 301.6141 323.6547 10 c
#> table 466.70126 499.04382 518.5858 509.88502 517.6628 586.3363 10 d
Run Code Online (Sandbox Code Playgroud)
由reprex 包(v0.3.0)于 2020-07-18 创建