SHR*_*ram 7 loops r large-data
我有非常大的数据集,其维度为60K x 4 K.我正在尝试在列的每一行中连续添加每四个值.以下是较小的示例数据集.
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1 1 1 0 1 1 0 1 1 0 0
[2,] 1 0 0 1 0 1 1 0 1 0 0 0
[3,] 0 1 1 0 0 1 1 1 0 0 0 0
[4,] 1 1 0 1 1 1 1 1 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
这是我想要执行的:
mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3
Run Code Online (Sandbox Code Playgroud)
即添加每四个值和输出.
mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2
Run Code Online (Sandbox Code Playgroud)
继续到矩阵的末尾(这里是12).
mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]
Run Code Online (Sandbox Code Playgroud)
完成第一行后,将其应用于第二行,如:
mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4]
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]
Run Code Online (Sandbox Code Playgroud)
结果将是nrow x (ncol)/4矩阵.
预期结果如下:
col1-col4 col5-8 col9-12
row1 3 2 2
row2 2 2 1
row3 2 3 0
row4 3 4 0
Run Code Online (Sandbox Code Playgroud)
类似地,对于行3到矩阵中的行数.我怎样才能有效地循环这个.
虽然Matthew的答案真的很酷(+ 1,顺便说一句),如果你避免apply和使用这些*Sums函数(在这种情况下colSums),你可以获得更多(~100x)更快的解决方案,以及一些矢量操作技巧:
funSums <- function(mat) {
t.mat <- t(mat) # rows become columns
dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum)
t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)
Run Code Online (Sandbox Code Playgroud)
产生所需的输出:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Run Code Online (Sandbox Code Playgroud)
现在,让我们制作一些真正的尺寸并与其他选项进行比较:
set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)
funApply <- function(mat) { # Matthew's Solution
apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) { # David's Solution
roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
funSums(mat),
funApply(mat),
funRcpp(mat)
)
Run Code Online (Sandbox Code Playgroud)
生产:
Unit: milliseconds
expr min lq median uq max neval
funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10
funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10
funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10
Run Code Online (Sandbox Code Playgroud)
并检查:
all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)
关键点在于*Sums函数完全"向量化",在所有计算中都发生在C中. apply仍然需要在R中做一堆不严格矢量化(在原始C函数方式中)的东西,并且速度较慢(但更灵活).
具体到这个问题,有可能使它快2-3倍,因为大约一半的时间花在转置上,这只是必要的,以便dim更改做我需要的colSums工作.
将矩阵划分为3D阵列是一种方式:
apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum)
# [,1] [,2] [,3]
# [1,] 3 2 2
# [2,] 2 2 1
# [3,] 2 3 0
# [4,] 3 4 0
Run Code Online (Sandbox Code Playgroud)
这是使用该RcppRoll包的另一种方法
library(RcppRoll) # Uses C++/Rcpp
n <- 4 # The summing range
roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1]
## [,1] [,2] [,3]
## [1,] 3 2 2
## [2,] 2 2 1
## [3,] 2 3 0
#3 [4,] 3 4 0
Run Code Online (Sandbox Code Playgroud)