如何调整数据表操作,以便除了sum几个列的类别之外,它还会同时计算其他函数,例如count mean和(.N)并自动创建列名:"sum c1","sum c2"," sum c4","表示c1","表示c2","表示c4",最好也是1列"计数"?
我的旧解决方案是写出来
mean col1 = ....
mean col2 = ....
Run Code Online (Sandbox Code Playgroud)
等等,在data.table命令中
哪个有效,但我觉得非常低效,如果在新的应用程序版本中,计算依赖于用户在R Shiny应用程序中的选择来计算哪些列,它将不再用于预编码.
我已经通过一堆帖子和博客文章阅读了我的方式,但还没有弄清楚如何最好地做到这一点.我读到,在某些情况下,根据您使用的方法(.sdcols,get,lapply和/或=),大型数据表上的操作会变得非常慢.因此我添加了一个"相当大"的虚拟数据集
我的实际数据大约是100k行乘100列和1-100组.
library(data.table)
n = 100000
dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
# add more columns to test for big data tables
lapply(c(paste('c', 5:100, sep ='')),
function(addcol) dt[[addcol]] <<- rnorm(n,1000) )
# Simulate columns selected by shiny app user
Colchoice <- c("c1", "c4")
FunChoice <- c(".N", "mean", "sum")
# attempt which now does just one function and doesn't add names
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=Colchoice ]
Run Code Online (Sandbox Code Playgroud)
预期输出是每组一行,每个选定列的每个函数列.
Category Mean c1 Sum c1 Mean c4 ...
A
B
C
D
E
......
Run Code Online (Sandbox Code Playgroud)
可能是重复但我没有找到我需要的确切答案
如果我理解正确,这个问题由两部分组成:
对于第 1 部分,这几乎是将多个函数应用于 data.table 中的多个列的重复,但附加要求结果应使用by =.
因此,必须通过在调用中添加参数来修改eddi 的答案:recursive = FALSEunlist()
my.summary = function(x) list(N = length(x), mean = mean(x), median = median(x))
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)category c1.N c1.mean c1.median c4.N c4.mean c4.median 1: f 3974 9999.987 9999.989 3974 9.994220 9.974125 2: w 4033 10000.008 9999.991 4033 10.004261 9.986771 3: n 4025 9999.981 10000.000 4025 10.003686 9.998259 4: x 3975 10000.035 10000.019 3975 10.010448 9.995268 5: k 3957 10000.019 10000.017 3957 9.991886 10.007873 6: j 4027 10000.026 10000.023 4027 10.015663 9.998103 ...
对于第 2 部分,我们需要my.summary()从函数名称的字符向量中创建。这可以通过“编程语言”来实现,即将表达式组装为字符串,最后对其进行解析和评估:
my.summary <-
sapply(FunChoice, function(f) paste0(f, "(x)")) %>%
paste(collapse = ", ") %>%
sprintf("function(x) setNames(list(%s), FunChoice)", .) %>%
parse(text = .) %>%
eval()
my.summary
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)function(x) setNames(list(length(x), mean(x), sum(x)), FunChoice) <environment: 0xe376640>
或者,我们可以循环遍历类别和rbind()结果:
library(magrittr) # used only to improve readability
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
Run Code Online (Sandbox Code Playgroud)
到目前为止,已经发布了4个data.table和1个dplyr解决方案。至少有一个答案声称是“超快”。因此,我想通过具有不同行数的基准来验证:
library(data.table)
library(magrittr)
bm <- bench::press(
n = 10L^(2:6),
{
set.seed(12212018)
dt <- data.table(
index = 1:n,
category = sample(letters[1:25], n, replace = T),
c1 = rnorm(n, 10000),
c2 = rnorm(n, 1000),
c3 = rnorm(n, 100),
c4 = rnorm(n, 10)
)
# use set() instead of <<- for appending additional columns
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n, 1000))
tables()
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.summary <- function(x) list(length = length(x), mean = mean(x), sum = sum(x))
bench::mark(
unlist = {
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
},
loop_category = {
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
},
dcast = {
dcast(dt, category ~ 1, fun = list(length, mean, sum), value.var = ColChoice)
},
loop_col = {
lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
) %>%
Reduce(function(x, y) merge(x, y, by="category"), .)
},
dplyr = {
dt %>%
dplyr::group_by(category) %>%
dplyr::summarise_at(dplyr::vars(ColChoice), .funs = setNames(FunChoice, FunChoice))
},
check = function(x, y)
all.equal(setDT(x)[order(category)],
setDT(y)[order(category)] %>%
setnames(stringr::str_replace(names(.), "_", ".")),
ignore.col.order = TRUE,
check.attributes = FALSE
)
)
}
)
Run Code Online (Sandbox Code Playgroud)
绘制时更容易比较结果:
library(ggplot2)
autoplot(bm)
Run Code Online (Sandbox Code Playgroud)
请注意对数时间刻度。
对于这个测试用例,unlist方法总是最快的方法,其次是dcast。dplyr正在追赶更大的问题规模n。lapply/loop两种方法的性能都较低。特别是,Parfait循环遍历列并随后合并子结果的方法似乎对问题大小相当敏感n。
正如jangorecki所建议的,我用更多的行和不同数量的组重复了基准测试。由于内存限制,最大的问题大小是 10 M 行乘以 102 列,占用 7.7 GB 内存。
因此,基准代码的第一部分修改为
bm <- bench::press(
n_grp = 10^(1:3),
n_row = 10L^seq(3, 7, by = 2),
{
set.seed(12212018)
dt <- data.table(
index = 1:n_row,
category = sample(n_grp, n_row, replace = TRUE),
c1 = rnorm(n_row),
c2 = rnorm(n_row),
c3 = rnorm(n_row),
c4 = rnorm(n_row, 10)
)
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n_row, 1000))
tables()
...
Run Code Online (Sandbox Code Playgroud)
正如jangorecki预期的那样,一些解决方案对组的数量比其他解决方案更敏感。特别是,loop_category 的性能随着组数的增加而下降得更厉害,而dcast似乎受到的影响较小。对于较少的组,unlist方法总是比dcast快,而对于许多组dcast则更快。但是,对于较大的问题,unlist似乎领先于dcast。
受此后续问题的启发,我添加了对语言方法的计算,其中将整个表达式创建为字符串、解析和评估。
该表达式是由
library(magrittr)
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.expression <- CJ(ColChoice, FunChoice, sorted = FALSE)[
, sprintf("%s.%s = %s(%s)", V1, V2, V2, V1)] %>%
paste(collapse = ", ") %>%
sprintf("dt[, .(%s), by = category]", .) %>%
parse(text = .)
my.expression
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)expression(dt[, .(c1.length = length(c1), c1.mean = mean(c1), c1.sum = sum(c1), c4.length = length(c4), c4.mean = mean(c4), c4.sum = sum(c4)), by = category])
然后通过以下方式评估
eval(my.expression)
Run Code Online (Sandbox Code Playgroud)
这产生
Run Code Online (Sandbox Code Playgroud)category c1.length c1.mean c1.sum c4.length c4.mean c4.sum 1: f 3974 9999.987 39739947 3974 9.994220 39717.03 2: w 4033 10000.008 40330032 4033 10.004261 40347.19 3: n 4025 9999.981 40249924 4025 10.003686 40264.84 4: x 3975 10000.035 39750141 3975 10.010448 39791.53 5: k 3957 10000.019 39570074 3957 9.991886 39537.89 6: j 4027 10000.026 40270106 4027 10.015663 40333.07 ...
我修改了第二个基准测试的代码以包含这种方法,但不得不将额外的列从 100 减少到 25,以应对更小的 PC 的内存限制。图表显示“eval”方法几乎总是最快或第二:
这是一个data.table答案:
funs_list <- lapply(FunChoice, as.symbol)
dcast(dt, category~1, fun=eval(funs_list), value.var = Colchoice)
Run Code Online (Sandbox Code Playgroud)
它超级快,可以满足您的需求.