coo*_*and 0 r vectorization dplyr
假设我有一些数据,其中有多个数据点,其中一些数据点共享一个组标识符:
group <- rep(c(1:5), times=3)
cost <- rnorm(length(group), 100, 5)
current_score <- rnorm(length(group), 7, 2)
future_score <- current_score*runif(1)
dat <- data.frame(group, cost, current_score, future_score)
Run Code Online (Sandbox Code Playgroud)
以及给出总体加权组得分的函数:
wt_score <- function(group, dat)
{
one_group_dat <- dat[dat$group == group, ]
wt_score <- sum(one_group_dat$cost * (one_group_dat$current_score - one_group_dat$future_score))/sum(one_group_dat$cost)
return(wt_score)
}
Run Code Online (Sandbox Code Playgroud)
有没有办法对上述函数进行矢量化,以便我不必使用如下所示的循环?问题在于,在实践中,一个函数应用于数万个组和数百万个数据点,因此循环非常慢。
# THIS IS TOO SLOW!
dat$wt_score <- 0
for(i in 1:nrow(dat))
{
dat$wt_score[i] <- wt_score(dat$group[i], dat)
}
Run Code Online (Sandbox Code Playgroud)
你不需要有一个函数:
\nlibrary(dplyr) # dplyr_1.1.0\ndat %>%\n mutate(wt_score2 = sum(cost * (current_score - future_score))/sum(cost), .by = group)\n# group cost current_score future_score wt_score wt_score2\n# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383\n# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164\n# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275\n# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018\n# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793\n# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383\n# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164\n# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275\n# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018\n# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793\n# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383\n# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164\n# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275\n# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018\n# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793\nRun Code Online (Sandbox Code Playgroud)\n(wt_score来自你的for循环,wt_score2是相同的)。
(如果您使用的是dplyr1.1 之前的版本,那么您不能使用.by=... 删除它,而是使用 dat %>% group_by(group) %>% mutate(...)。)
当您标记dplyr时,这也可以在基础 R 中完成。
\ndat$wt_score3 <- ave(\n 1:nrow(dat), dat$group,\n FUN = function(z) sum(dat$cost[z] * (dat$current_score[z] - dat$future_score[z]))/sum(dat$cost[z]))\ndat\n# group cost current_score future_score wt_score wt_score3\n# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383\n# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164\n# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275\n# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018\n# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793\n# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383\n# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164\n# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275\n# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018\n# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793\n# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383\n# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164\n# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275\n# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018\n# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793\nRun Code Online (Sandbox Code Playgroud)\nstats::ave不接受多个参数,因此我们向其提供行索引,并处理从帧内部提取到匿名FUN操作的过程。
用小框架进行基准测试还为时过早,不应该被信任。例如,bench::mark在 15 行上的 3 个上使用,我们看到
# A tibble: 3 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc \n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> \n1 ave 48.27\xc2\xb5s 56.18\xc2\xb5s 17541. 0B 8.75 8020 4 457ms <NULL> <Rprofmem [0 \xc3\x97 3]> <bench_tm [8,024]> <tibble [8,024 \xc3\x97 3]>\n2 dplyr 1.06ms 1.13ms 882. 4.85KB 5.35 330 2 374ms <NULL> <Rprofmem [16 \xc3\x97 3]> <bench_tm [332]> <tibble [332 \xc3\x97 3]> \n3 forloop 524.06\xc2\xb5s 621.33\xc2\xb5s 1622. 0B 8.71 745 4 459ms <NULL> <Rprofmem [0 \xc3\x97 3]> <bench_tm [749]> <tibble [749 \xc3\x97 3]> \nRun Code Online (Sandbox Code Playgroud)\n其中`itr/sec`(越高越好)和median(越低越好)是两个很好的衡量标准。然而,如果我们使用更大的框架,
datbig <- bind_rows(replicate(1000, dat, simplify=FALSE))\nnrow(datbig)\n# [1] 15000\n\nbench::mark(...)\n# A tibble: 3 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc \n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> \n1 ave 764.12\xc2\xb5s 833.87\xc2\xb5s 971. 1.39MB 9.85 493 5 507.87ms <NULL> <Rprofmem [42 \xc3\x97 3]> <bench_tm [493]> <tibble [493 \xc3\x97 3]>\n2 dplyr 1.34ms 1.41ms 666. 1013.19KB 4.00 333 2 500.2ms <NULL> <Rprofmem [49 \xc3\x97 3]> <bench_tm [333]> <tibble [333 \xc3\x97 3]>\n3 forloop 3.58s 3.58s 0.280 12.38GB 21.5 1 77 3.58s <NULL> <Rprofmem [375,000 \xc3\x97 3]> <bench_tm [1]> <tibble [1 \xc3\x97 3]> \nRun Code Online (Sandbox Code Playgroud)\n我们可以看到不同的收获。另一个(大!)因素是小组的数量,我怀疑更多的小组会改变结果。(for这里的循环非常慢......)
另一种选择:迭代唯一组而不是每一行。
\nfor (grp in unique(dat$group)) {\n dat$wt_score3[dat$group == grp] <- wt_score(grp, dat)\n}\ndat\n# group cost current_score future_score wt_score wt_score3\n# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383\n# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164\n# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275\n# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018\n# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793\n# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383\n# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164\n# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275\n# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018\n# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793\n# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383\n# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164\n# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275\n# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018\n# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793\nRun Code Online (Sandbox Code Playgroud)\n我们不迭代nrow(dat)次数,而只是迭代length(unique(dat$group))次数。这比第一个循环要快得多for。这是比较(使用datbig)。
# A tibble: 4 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc \n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> \n1 ave 771.51\xc2\xb5s 857.27\xc2\xb5s 935. 1.39MB 9.76 479 5 512.55ms <NULL> <Rprofmem [42 \xc3\x97 3]> <bench_tm [479]> <tibble [479 \xc3\x97 3]>\n2 dplyr 1.39ms 1.51ms 617. 1013.19KB 4.00 309 2 500.44ms <NULL> <Rprofmem [49 \xc3\x97 3]> <bench_tm [309]> <tibble [309 \xc3\x97 3]>\n3 forloop 3.83s 3.83s 0.261 12.38GB 21.9 1 84 3.83s <NULL> <Rprofmem [375,000 \xc3\x97 3]> <bench_tm [1]> <tibble [1 \xc3\x97 3]> \n4 forloop2 932.03\xc2\xb5s 974.12\xc2\xb5s 758. 5.04MB 22.0 379 11 500.28ms <NULL> <Rprofmem [142 \xc3\x97 3]> <bench_tm [379]> <tibble [379 \xc3\x97 3]>\nRun Code Online (Sandbox Code Playgroud)\n表明减少的循环与和for相当。avedplyr
(我也忍不住看到了差异mem_alloc。12GB 与 1-5MB 相比?是的,这说明了为什么每行循环不是最明智的选择的另一个很好的理由for。)
如果您想从中获得更多速度,还有另一种选择:
\nlibrary(data.table)\nDT <- as.data.table(datbig)\nDT[, wt_score := sum(cost * (current_score - future_score))/sum(cost), by = .(group)]\n\n# A tibble: 5 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc \n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> \n1 ave 754.55\xc2\xb5s 837.44\xc2\xb5s 1020. 1.39MB 10.0 510 5 500.1ms <NULL> <Rprofmem [42 \xc3\x97 3]> <bench_tm [510]> <tibble [510 \xc3\x97 3]>\n2 dplyr 1.32ms 1.38ms 679. 1013.19KB 4.00 340 2 500.41ms <NULL> <Rprofmem [49 \xc3\x97 3]> <bench_tm [340]> <tibble [340 \xc3\x97 3]>\n3 forloop 4.11s 4.11s 0.243 12.38GB 22.4 1 92 4.11s <NULL> <Rprofmem [375,007 \xc3\x97 3]> <bench_tm [1]> <tibble [1 \xc3\x97 3]> \n4 forloop2 893.94\xc2\xb5s 958.22\xc2\xb5s 785. 5.04MB 22.0 393 11 500.33ms <NULL> <Rprofmem [142 \xc3\x97 3]> <bench_tm [393]> <tibble [393 \xc3\x97 3]>\n5 datatable 442.91\xc2\xb5s 481.71\xc2\xb5s 1931. 326.06KB 4.00 966 2 500.22ms <NULL> <Rprofmem [17 \xc3\x97 3]> <bench_tm [966]> <tibble [966 \xc3\x97 3]>\nRun Code Online (Sandbox Code Playgroud)\n但我不建议虚假地切换到data.table:它速度很快并且工作得很好,但与基础 R 和 dplyr 相比,它有一个非常不同的“方言”。在继续做另一件事之前,先学会把其中一件事情做好。只有当您的数据非常庞大(主观和相对)时,我才会真正敦促您进行此转换。
mem_alloc(我真的很喜欢DT 实现的显着减少......)