对 R 中需要分组的函数进行向量化

coo*_*and 0 r vectorization dplyr

假设我有一些数据,其中有多个数据点,其中一些数据点共享一个组标识符:

group <- rep(c(1:5), times=3)
cost <- rnorm(length(group), 100, 5)
current_score <- rnorm(length(group), 7, 2)
future_score <- current_score*runif(1)

dat <- data.frame(group, cost, current_score, future_score)
Run Code Online (Sandbox Code Playgroud)

以及给出总体加权组得分的函数:

wt_score <- function(group, dat)
{
  one_group_dat <- dat[dat$group == group, ]
  wt_score <- sum(one_group_dat$cost * (one_group_dat$current_score - one_group_dat$future_score))/sum(one_group_dat$cost)
  return(wt_score)
}
Run Code Online (Sandbox Code Playgroud)

有没有办法对上述函数进行矢量化,以便我不必使用如下所示的循环?问题在于,在实践中,一个函数应用于数万个组和数百万个数据点,因此循环非常慢。

# THIS IS TOO SLOW!
dat$wt_score <- 0
for(i in 1:nrow(dat))
{
  dat$wt_score[i] <- wt_score(dat$group[i], dat)
}
Run Code Online (Sandbox Code Playgroud)

r2e*_*ans 5

你不需要有一个函数:

\n
library(dplyr) # dplyr_1.1.0\ndat %>%\n  mutate(wt_score2 = sum(cost * (current_score - future_score))/sum(cost), .by = group)\n#    group      cost current_score future_score wt_score wt_score2\n# 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383\n# 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164\n# 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275\n# 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018\n# 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793\n# 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383\n# 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164\n# 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275\n# 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018\n# 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793\n# 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383\n# 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164\n# 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275\n# 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018\n# 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793\n
Run Code Online (Sandbox Code Playgroud)\n

wt_score来自你的for循环,wt_score2是相同的)。

\n

(如果您使用的是dplyr1.1 之前的版本,那么您不能使用.by=... 删除它,而是使用 dat %>% group_by(group) %>% mutate(...)。)

\n

当您标记时,这也可以在基础 R 中完成。

\n
dat$wt_score3 <- ave(\n  1:nrow(dat), dat$group,\n  FUN = function(z) sum(dat$cost[z] * (dat$current_score[z] - dat$future_score[z]))/sum(dat$cost[z]))\ndat\n#    group      cost current_score future_score wt_score wt_score3\n# 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383\n# 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164\n# 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275\n# 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018\n# 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793\n# 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383\n# 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164\n# 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275\n# 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018\n# 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793\n# 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383\n# 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164\n# 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275\n# 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018\n# 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793\n
Run Code Online (Sandbox Code Playgroud)\n

stats::ave不接受多个参数,因此我们向其提供行索引,并处理从帧内部提取到匿名FUN操作的过程。

\n
\n

用小框架进行基准测试还为时过早,不应该被信任。例如,bench::mark在 15 行上的 3 个上使用,我们看到

\n
# A tibble: 3 \xc3\x97 13\n  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time               gc                  \n  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>             <list>              \n1 ave         48.27\xc2\xb5s  56.18\xc2\xb5s    17541.        0B     8.75  8020     4      457ms <NULL> <Rprofmem [0 \xc3\x97 3]>  <bench_tm [8,024]> <tibble [8,024 \xc3\x97 3]>\n2 dplyr        1.06ms   1.13ms      882.    4.85KB     5.35   330     2      374ms <NULL> <Rprofmem [16 \xc3\x97 3]> <bench_tm [332]>   <tibble [332 \xc3\x97 3]>  \n3 forloop    524.06\xc2\xb5s 621.33\xc2\xb5s     1622.        0B     8.71   745     4      459ms <NULL> <Rprofmem [0 \xc3\x97 3]>  <bench_tm [749]>   <tibble [749 \xc3\x97 3]>  \n
Run Code Online (Sandbox Code Playgroud)\n

其中`itr/sec`(越高越好)和median(越低越好)是两个很好的衡量标准。然而,如果我们使用更大的框架,

\n
datbig <- bind_rows(replicate(1000, dat, simplify=FALSE))\nnrow(datbig)\n# [1] 15000\n\nbench::mark(...)\n# A tibble: 3 \xc3\x97 13\n  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                \n  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            \n1 ave        764.12\xc2\xb5s 833.87\xc2\xb5s   971.       1.39MB     9.85   493     5   507.87ms <NULL> <Rprofmem [42 \xc3\x97 3]>      <bench_tm [493]> <tibble [493 \xc3\x97 3]>\n2 dplyr        1.34ms   1.41ms   666.    1013.19KB     4.00   333     2    500.2ms <NULL> <Rprofmem [49 \xc3\x97 3]>      <bench_tm [333]> <tibble [333 \xc3\x97 3]>\n3 forloop       3.58s    3.58s     0.280   12.38GB    21.5      1    77      3.58s <NULL> <Rprofmem [375,000 \xc3\x97 3]> <bench_tm [1]>   <tibble [1 \xc3\x97 3]>  \n
Run Code Online (Sandbox Code Playgroud)\n

我们可以看到不同的收获。另一个(大!)因素是小组的数量,我怀疑更多的小组会改变结果。(for这里的循环非常慢......)

\n
\n

另一种选择:迭代唯一组而不是每一行。

\n
for (grp in unique(dat$group)) {\n  dat$wt_score3[dat$group == grp] <- wt_score(grp, dat)\n}\ndat\n#    group      cost current_score future_score wt_score wt_score3\n# 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383\n# 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164\n# 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275\n# 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018\n# 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793\n# 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383\n# 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164\n# 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275\n# 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018\n# 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793\n# 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383\n# 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164\n# 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275\n# 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018\n# 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793\n
Run Code Online (Sandbox Code Playgroud)\n

我们不迭代nrow(dat)次数,而只是迭代length(unique(dat$group))次数。这比第一个循环要快得多for。这是比较(使用datbig)。

\n
# A tibble: 4 \xc3\x97 13\n  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                \n  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            \n1 ave        771.51\xc2\xb5s 857.27\xc2\xb5s   935.       1.39MB     9.76   479     5   512.55ms <NULL> <Rprofmem [42 \xc3\x97 3]>      <bench_tm [479]> <tibble [479 \xc3\x97 3]>\n2 dplyr        1.39ms   1.51ms   617.    1013.19KB     4.00   309     2   500.44ms <NULL> <Rprofmem [49 \xc3\x97 3]>      <bench_tm [309]> <tibble [309 \xc3\x97 3]>\n3 forloop       3.83s    3.83s     0.261   12.38GB    21.9      1    84      3.83s <NULL> <Rprofmem [375,000 \xc3\x97 3]> <bench_tm [1]>   <tibble [1 \xc3\x97 3]>  \n4 forloop2   932.03\xc2\xb5s 974.12\xc2\xb5s   758.       5.04MB    22.0    379    11   500.28ms <NULL> <Rprofmem [142 \xc3\x97 3]>     <bench_tm [379]> <tibble [379 \xc3\x97 3]>\n
Run Code Online (Sandbox Code Playgroud)\n

表明减少的循环与和for相当。avedplyr

\n

(我也忍不住看到了差异mem_alloc。12GB 与 1-5MB 相比?是的,这说明了为什么每行循环不是最明智的选择的另一个很好的理由for。)

\n

如果您想从中获得更多速度,还有另一种选择:

\n
library(data.table)\nDT <- as.data.table(datbig)\nDT[, wt_score := sum(cost * (current_score - future_score))/sum(cost), by = .(group)]\n\n# A tibble: 5 \xc3\x97 13\n  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                \n  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            \n1 ave        754.55\xc2\xb5s 837.44\xc2\xb5s  1020.       1.39MB    10.0    510     5    500.1ms <NULL> <Rprofmem [42 \xc3\x97 3]>      <bench_tm [510]> <tibble [510 \xc3\x97 3]>\n2 dplyr        1.32ms   1.38ms   679.    1013.19KB     4.00   340     2   500.41ms <NULL> <Rprofmem [49 \xc3\x97 3]>      <bench_tm [340]> <tibble [340 \xc3\x97 3]>\n3 forloop       4.11s    4.11s     0.243   12.38GB    22.4      1    92      4.11s <NULL> <Rprofmem [375,007 \xc3\x97 3]> <bench_tm [1]>   <tibble [1 \xc3\x97 3]>  \n4 forloop2   893.94\xc2\xb5s 958.22\xc2\xb5s   785.       5.04MB    22.0    393    11   500.33ms <NULL> <Rprofmem [142 \xc3\x97 3]>     <bench_tm [393]> <tibble [393 \xc3\x97 3]>\n5 datatable  442.91\xc2\xb5s 481.71\xc2\xb5s  1931.     326.06KB     4.00   966     2   500.22ms <NULL> <Rprofmem [17 \xc3\x97 3]>      <bench_tm [966]> <tibble [966 \xc3\x97 3]>\n
Run Code Online (Sandbox Code Playgroud)\n

但我不建议虚假地切换到data.table:它速度很快并且工作得很好,但与基础 R 和 dplyr 相比,它有一个非常不同的“方言”。在继续做另一件事之前,先学会把其中一件事情做好。只有当您的数据非常庞大(主观和相对)时,我才会真正敦促您进行此转换。

\n

mem_alloc(我真的很喜欢DT 实现的显着减少......)

\n