当尝试cor()在稀疏矩阵(dgCMatrix或dgTMatrix类型)上运行该函数时,我收到以下错误:
Error in cor(x) : supply both 'x' and 'y' or a matrix-like 'x'
Run Code Online (Sandbox Code Playgroud)
将我的矩阵转换为密集将是非常低效的.有没有一种简单的方法来计算这种相关性(没有所有对循环?).
谢谢,
Jor*_*eys 14
EDITED ANSWER - 针对内存使用和速度进行了优化.
您的错误是逻辑错误,因为稀疏矩阵不被cor函数识别为矩阵,并且-yet-没有方法可以在Matrix包中进行相关.
没有我知道的功能可以让你计算这个,但你可以使用Matrix包中提供的矩阵运算符轻松地自己计算:
sparse.cor <- function(x){
n <- nrow(x)
m <- ncol(x)
ii <- unique(x@i)+1 # rows with a non-zero element
Ex <- colMeans(x)
nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii)) # colmeans
covmat <- ( crossprod(matrix(nozero,ncol=m)) +
crossprod(t(Ex))*(n-length(ii))
)/(n-1)
sdvec <- sqrt(diag(covmat))
covmat/crossprod(t(sdvec))
}
Run Code Online (Sandbox Code Playgroud)
这covmat是你的方差 - 协方差矩阵,所以你也可以计算那个.该计算基于选择至少一个元素非零的行.对于这个的交叉乘积,您将colmeans乘以全零行的数量.这相当于
换算(X - E [X])次(X - E [X])
除以n-1,你就得到方差 - 协方差矩阵.其余的很容易.
一个测试用例:
X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
xx <- Matrix(X,ncol=5)
> system.time(out1 <- sparse.cor(xx))
user system elapsed
0.50 0.09 0.59
> system.time(out2 <- cor(as.matrix(xx)))
user system elapsed
1.75 0.28 2.05
> all.equal(out1,out2)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)
Ron*_*Ron 13
这就是我最终使用的.感谢@Joris提供的所有帮助!
我的x矩阵非常大.假设它的大小n * p,n=200k而p=10k在我的情况.
诀窍是保持操作的稀疏性并在p * p矩阵上执行计算而不是p*n x n*p.
版本1更简单,但在时间和内存上效率较低,因为外部产品操作很昂贵:
sparse.cor2 <- function(x){
n <- nrow(x)
covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
+n*colMeans(x)%o%colMeans(x))/(n-1)
sdvec <- sqrt(diag(covmat)) # standard deviations of columns
covmat/crossprod(t(sdvec)) # correlation matrix
}
Run Code Online (Sandbox Code Playgroud)
版本2在时间上(保存多个操作)和内存更有效.p=10k矩阵仍需要大量内存:
sparse.cor3 <- function(x){
memory.limit(size=10000)
n <- nrow(x)
cMeans <- colMeans(x)
cSums <- colSums(x)
# Calculate the population covariance matrix.
# There's no need to divide by (n-1) as the std. dev is also calculated the same way.
# The code is optimized to minize use of memory and expensive operations
covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
crossp <- as.matrix(crossprod(x))
covmat <- covmat+crossp
sdvec <- sqrt(diag(covmat)) # standard deviations of columns
covmat/crossprod(t(sdvec)) # correlation matrix
}
Run Code Online (Sandbox Code Playgroud)
时序比较(sparse.cor是@Joris最新版本):
> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
>
> object.size(x)
11999472 bytes
>
> system.time(corx <- sparse.cor(x))
user system elapsed
0.50 0.06 0.56
> system.time(corx2 <- sparse.cor2(x))
user system elapsed
0.17 0.00 0.17
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
0.13 0.00 0.12
> system.time(correg <-cor(as.matrix(x)))
user system elapsed
0.25 0.03 0.29
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
Run Code Online (Sandbox Code Playgroud)
更大的x矩阵:
> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
user system elapsed
1.47 0.07 1.53
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
1.18 0.09 1.29
> system.time(corx <- sparse.cor(x))
user system elapsed
5.43 1.26 6.71
Run Code Online (Sandbox Code Playgroud)
小智 8
答案得到了@Ron的优雅解决,但对解决方案稍作修改就更清晰了,并且还返回了样本协方差矩阵.
sparse.cor4 <- function(x){
n <- nrow(x)
cMeans <- colMeans(x)
covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
sdvec <- sqrt(diag(covmat))
cormat <- covmat/tcrossprod(sdvec)
list(cov=covmat,cor=cormat)
}
Run Code Online (Sandbox Code Playgroud)
简化来自于:使用nxp矩阵X,并且列的nxp矩阵M表示X:
cov(X) = E[(X-M)'(X-M)] = E[X'X - M'X - X'M + M'M]
M'X = X'M = M'M, which have (i,j) elements = sum(column i) * sum(column j) / n
= n * mean(column i) * mean(column j)
Run Code Online (Sandbox Code Playgroud)
或者用列的行向量m表示,
= n * m'm
Run Code Online (Sandbox Code Playgroud)
然后 cov(X) = E[X'X - n m'm]
现在它变得更快了.
> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> system.time(corx <- sparse.cor(x))
user system elapsed
1.139 0.196 1.334
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
0.194 0.007 0.201
> system.time(corx4 <- sparse.cor4(x))
user system elapsed
0.187 0.007 0.194
> system.time(correg <-cor(as.matrix(x)))
user system elapsed
0.341 0.067 0.407
> system.time(covreg <- cov(as.matrix(x)))
user system elapsed
0.314 0.016 0.330
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
[1] TRUE
Run Code Online (Sandbox Code Playgroud)