一次拟合多个公式，比lapply更快的选择？

Question

一次拟合多个公式，比lapply更快的选择？

我有一个要适合数据的公式列表，而不是运行一个循环，出于性能考虑，我想立即执行此操作。估算应该仍然是分开的，我不是要估算SUR或其他任何值。下面的代码做我想要的

x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
                y~x[,2],
                y~x[,1] + x[,2])
lapply(formulae,lm)

Run Code Online (Sandbox Code Playgroud)

不幸的是，formulae随着增加长度的增加，这变得有些慢了，有没有办法真正将其向量化？

如果有帮助，lm我唯一关心的结果就是系数和一些标准误差。

Answer 1

李哲源*_*李哲源 5

就像我在评论中说的那样，您真正需要的是除之外的更高效，更稳定的拟合例程lm()。在这里，我将为您提供经过精心测试的自己写的一个名为lm.chol()。它需要formula加上和data，然后返回：

系数摘要表，如您通常在中看到的summary(lm(...))$coef；
从中得到的残留标准误差的皮尔森估计值summary(lm(...))$sigma；
调整后的R.squared，来自summary(lm(...))$adj.r.squared。

## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
  ## stage0: get response vector and model matrix
  ## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
  y <- data[[as.character(formula[[2]])]]
  X <- model.matrix(formula, data)
  n <- nrow(X); p <- ncol(X)
  ## stage 1: XtX and Jacobi diagonal preconditioner
  XtX <- crossprod(X)
  D <- 1 / sqrt(diag(XtX))
  ## stage 2: pivoted Cholesky factorization
  R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
  piv <- attr(R, "pivot")
  r <- attr(R, "rank")
  if (r < p) {
    warning("Model is rank-deficient!")
    piv <- piv[1:r]
    R <- R[1:r, 1:r]
    }
  ## stage 3: solve linear system for coefficients
  D <- D[piv]
  b <- D * crossprod(X, y)[piv]
  z <- forwardsolve(t(R), b)
  RSS <- sum(y * y) - sum(z * z)
  sigma <- sqrt(RSS / (n - r))
  para <- D * backsolve(R, z)
  beta.hat <- rep(NA, p)
  beta.hat[piv] <- para
  ## stage 4: get standard error
  Rinv <- backsolve(R, diag(r))
  se <- rep(NA, p)
  se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
  ## stage 5: t-statistic and p-value
  t.statistic <- beta.hat / se
  p.value <- 2 * pt(-abs(t.statistic), df = n - r)
  ## stage 6: construct coefficient summary matrix
  coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
  colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  rownames(coefficients) <- colnames(X)
  ## stage 7: compute adjusted R.squared
  adj.R2 <- 1 - sigma * sigma / var(y)
  ## return model fitting results
  attr(coefficients, "sigma") <- sigma
  attr(coefficients, "adj.R2") <- adj.R2
  coefficients
  }

Run Code Online (Sandbox Code Playgroud)

在这里，我将提供三个示例。

示例1：满秩线性模型

我们以R的内置数据集trees为例。

# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept)  83.2958     9.0866   9.167 6.33e-10 ***
#Girth        -1.8615     1.1567  -1.609   0.1188    
#Volume        0.5756     0.2208   2.607   0.0145 *  
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared:  0.4123,   Adjusted R-squared:  0.3703 
#F-statistic:  9.82 on 2 and 28 DF,  p-value: 0.0005868

## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
#              Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 83.2957705  9.0865753  9.166905 6.333488e-10
#Girth       -1.8615109  1.1566879 -1.609346 1.187591e-01
#Volume       0.5755946  0.2208225  2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869

Run Code Online (Sandbox Code Playgroud)

结果完全一样！

示例2：秩不足线性模型

## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)

## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)   0.2164     0.2530   0.856    0.394
#x1           -0.1526     0.3252  -0.469    0.640
#x2           -0.3534     0.5707  -0.619    0.537
#x3                NA         NA      NA       NA

#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared:  0.0069,   Adjusted R-squared:  -0.01358 
#F-statistic: 0.337 on 2 and 97 DF,  p-value: 0.7147

## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
#              Estimate Std. Error    t value  Pr(>|t|)
#(Intercept)  0.2164455  0.2529576  0.8556595 0.3942949
#x1                  NA         NA         NA        NA
#x2          -0.2007894  0.6866871 -0.2924030 0.7706030
#x3          -0.3051760  0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!

Run Code Online (Sandbox Code Playgroud)

在此，lm.chol()基于具有完全旋转的Cholesky分解和lm()基于具有部分旋转的QR分解将不同的系数缩小为NA。但是两个估计是等效的，具有相同的拟合值和残差。

示例3：大型线性模型的性能

n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)

## using `lm()`
system.time(lm(y ~ ., dat))
#   user  system elapsed 
#  3.212   0.096   3.315

## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
#   user  system elapsed 
#  1.024   0.028   1.056

Run Code Online (Sandbox Code Playgroud)

lm.chol()比快3〜4倍lm()。如果您想知道原因，请阅读我的答案。

备注

我专注于提高计算内核的性能。通过使用Ben Bolker的并行性建议，您可以更进一步。如果我的方法可以使4核提升3倍，而并行计算可以使4核提升3倍，那么最终您将获得9倍的提升！

归档时间：	9 年，4 月前
查看次数：	278 次
最近记录：	9 年，4 月前