clp*_*clp 13 performance r vectorization dataframe
考虑以下数据框
x y z
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
-------
x 4 2 1 <--- vector to multiply by
Run Code Online (Sandbox Code Playgroud)
我想将每一列乘以一个单独的值,例如 c(4,2,1)。给予:
x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
Run Code Online (Sandbox Code Playgroud)
代码:
pw2 <- c(4, 2, 1)
s01 <- seq_len(2) - 1
df <- expand.grid(x=s01, y=s01, z=s01)
df
for (d in seq_len(3)) df[,d] <- df[,d] * pw2[d]
df
Run Code Online (Sandbox Code Playgroud)
问题:找到一个没有 for 循环的向量化解决方案(以 R 为基础)。
注意:问题将数据框中的列乘以向量是不明确的,因为它包括:
这两个查询都可以通过 for 循环轻松解决。这里明确要求矢量化解决方案。
Maë*_*aël 11
用于sweep在数据框的边缘应用函数:
sweep(df, 2, pw2, `*`)\nRun Code Online (Sandbox Code Playgroud)\n或与col:
df * pw2[col(df)]\nRun Code Online (Sandbox Code Playgroud)\n输出
\n x y z\n1 0 0 0\n2 4 0 0\n3 0 2 0\n4 4 2 0\n5 0 0 1\n6 4 0 1\n7 0 2 1\n8 4 2 1\nRun Code Online (Sandbox Code Playgroud)\n对于大型数据帧,请检查collapse::TRA,它比任何其他答案快 10 倍(请参阅基准测试):
collapse::TRA(df, pw2, "*")\nRun Code Online (Sandbox Code Playgroud)\n基准:
\nbench::mark(sweep = sweep(df, 2, pw2, `*`),\n col = df * pw2[col(df)],\n \'%*%\' = setNames(\n as.data.frame(as.matrix(df) %*% diag(pw2)), \n names(df)\n ), \n TRA = collapse::TRA(df, pw2, "*"), \n mapply = data.frame(mapply(FUN = `*`, df, pw2)),\n apply = t(apply(df, 1, \\(x) x*pw2)), \n t = t(t(df)*pw2), check = FALSE,\n )\n\n# A tibble: 7 \xc3\x97 13\n expression min median itr/s\xe2\x80\xa6\xc2\xb9 mem_al\xe2\x80\xa6\xc2\xb2 gc/se\xe2\x80\xa6\xc2\xb3 n_itr n_gc total\xe2\x80\xa6\xe2\x81\xb4\n <bch:expr> <bch:tm> <bch:t> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t>\n1 sweep 346.7\xc2\xb5s 382.1\xc2\xb5s 2427. 1.23KB 10.6 1141 5 470.2ms\n2 col 303.1\xc2\xb5s 330.4\xc2\xb5s 2760. 784B 8.45 1307 4 473.5ms\n3 %*% 72.8\xc2\xb5s 77.9\xc2\xb5s 11861. 480B 10.6 5599 5 472.1ms\n4 TRA 5\xc2\xb5s 5.5\xc2\xb5s 167050. 0B 16.7 9999 1 59.9ms\n5 mapply 117.6\xc2\xb5s 127.9\xc2\xb5s 7309. 480B 10.6 3442 5 470.9ms\n6 apply 107.8\xc2\xb5s 117.9\xc2\xb5s 7887. 6.49KB 12.9 3658 6 463.8ms\n7 t 55.3\xc2\xb5s 59.7\xc2\xb5s 15238. 720B 8.13 5620 3 368.8ms\nRun Code Online (Sandbox Code Playgroud)\n
zep*_*ryl 10
df将和转换pw2为矩阵,使用%*%矩阵乘法运算符,然后转换回数据帧。这将删除列名称,因此请换行setNames()以保留它们。
setNames(
as.data.frame(as.matrix(df) %*% diag(pw2)),
names(df)
)
Run Code Online (Sandbox Code Playgroud)
x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
Run Code Online (Sandbox Code Playgroud)
使用mapply():
mapply(FUN = `*`, df, pw2)
x y z
[1,] 0 0 0
[2,] 4 0 0
[3,] 0 2 0
[4,] 4 2 0
[5,] 0 0 1
[6,] 4 0 1
[7,] 0 2 1
[8,] 4 2 1
Run Code Online (Sandbox Code Playgroud)
并作为数据框:
data.frame(mapply(FUN = `*`, df, pw2))
x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
Run Code Online (Sandbox Code Playgroud)
与转置一起使用的另一个选项apply如下:
pw2 <- c(4, 2, 1)
t(apply(df, 1, \(x) x*pw2))
#> x y z
#> 1 0 0 0
#> 2 4 0 0
#> 3 0 2 0
#> 4 4 2 0
#> 5 0 0 1
#> 6 4 0 1
#> 7 0 2 1
#> 8 4 2 1
Run Code Online (Sandbox Code Playgroud)
创建于 2023-04-10,使用reprex v2.0.2
这是另一种选择,您可以将向量转换为与数据框维度相同的矩阵,然后简单地将两者相乘:
t(replicate(nrow(df), pw2)) * df
Run Code Online (Sandbox Code Playgroud)
输出
x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
Run Code Online (Sandbox Code Playgroud)
所有答案中现有的方法看起来都很棒,但我相信如果我们使用+代替(特别是当您更喜欢使用基本 R 时),mapply我们可以实现更高的效率Maplist2DF
以下是基准mapply和Map变体
microbenchmark(\n "mapply1" = data.frame(mapply(FUN = `*`, df, pw2)),\n "mapply2" = as.data.frame(mapply(FUN = `*`, df, pw2)),\n "Map1" = list2DF(Map(`*`, df, pw2)),\n "Map2" = list2DF(Map(`*`, df, as.list(pw2)))\n)\nRun Code Online (Sandbox Code Playgroud)\n给出
\nUnit: microseconds\n expr min lq mean median uq max neval\n mapply1 74.6 78.60 112.163 97.05 140.50 342.6 100\n mapply2 34.6 38.20 55.513 42.70 67.40 313.5 100\n Map1 23.8 25.25 33.728 27.60 41.30 113.8 100\n Map2 25.9 28.75 40.866 32.95 47.65 238.6 100\nRun Code Online (Sandbox Code Playgroud)\n另外,让该方法加入@Ma\xc3\xablMap提供的基准测试方,例如,
bc <- bench::mark(\n sweep = sweep(df, 2, pw2, `*`),\n col = df * pw2[col(df)],\n "%*%" = setNames(\n as.data.frame(as.matrix(df) %*% diag(pw2)),\n names(df)\n ),\n TRA = collapse::TRA(df, pw2, "*"),\n mapply1 = data.frame(mapply(FUN = `*`, df, pw2)),\n mapply2 = as.data.frame(mapply(FUN = `*`, df, pw2)),\n Map1 = list2DF(Map(`*`, df, pw2)),\n Map2 = list2DF(Map(`*`, df, as.list(pw2))),\n apply = t(apply(df, 1, \\(x) x * pw2)),\n t = t(t(df) * pw2),\n check = FALSE,\n)\nRun Code Online (Sandbox Code Playgroud)\n我们会看到它Map在效率方面排在第二位
# A tibble: 10 \xc3\x97 13\n expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc\n <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>\n 1 sweep 201.7\xc2\xb5s 249.2\xc2\xb5s 3526. 101.24KB 12.6 1680 6\n 2 col 174.9\xc2\xb5s 225.6\xc2\xb5s 3637. 9.02KB 10.4 1748 5\n 3 %*% 45.4\xc2\xb5s 52.9\xc2\xb5s 17026. 36.95KB 12.5 8158 6\n 4 TRA 3.4\xc2\xb5s 3.8\xc2\xb5s 226089. 905.09KB 22.6 9999 1\n 5 mapply1 71.6\xc2\xb5s 78.4\xc2\xb5s 11958. 480B 14.7 5681 7\n 6 mapply2 33.1\xc2\xb5s 37.4\xc2\xb5s 25339. 480B 17.7 9993 7\n 7 Map1 22.5\xc2\xb5s 26.1\xc2\xb5s 35649. 0B 17.8 9995 5\n 8 Map2 25.3\xc2\xb5s 29.4\xc2\xb5s 31785. 0B 19.1 9994 6\n 9 apply 70.2\xc2\xb5s 80.7\xc2\xb5s 11684. 11.91KB 14.7 5562 7\n10 t 34.8\xc2\xb5s 40.2\xc2\xb5s 23608. 3.77KB 14.2 9994 6\n# \xe2\x84\xb9 5 more variables: total_time <bch:tm>, result <list>, memory <list>,\n# time <list>, gc <list>\nRun Code Online (Sandbox Code Playgroud)\n并autoplot(bc)展示