使用二次多项式和直线在断点处平滑连接的分段回归

Question

使用二次多项式和直线在断点处平滑连接的分段回归

teh*_*m0n 5 regression r linear-regression piecewise lm

我想用一个断点拟合分段线性回归xt,这样x < xt我们就得到二次多项式,并且x >= xt我们有一条直线.两个部分应该顺利连接,连续性高达一阶导数xt.这是它的外观图片:

我将我的分段回归函数参数化为:

其中a,b,c和xt要被估计的参数.

我想在调整的R平方方面将该模型与整个范围内的二次多项式回归进行比较.

这是我的数据:

y <- c(1, 0.59, 0.15, 0.078, 0.02, 0.0047, 0.0019, 1, 0.56, 0.13, 
0.025, 0.0051, 0.0016, 0.00091, 1, 0.61, 0.12, 0.026, 0.0067, 
0.00085, 4e-04)

x <- c(0, 5.53, 12.92, 16.61, 20.3, 23.07, 24.92, 0, 5.53, 12.92, 
16.61, 20.3, 23.07, 24.92, 0, 5.53, 12.92, 16.61, 20.3, 23.07, 
24.92)

Run Code Online (Sandbox Code Playgroud)

我的尝试如下,一个已知的xt:

z <- pmax(0, x - xt)
x1 <- pmin(x, xt)
fit <- lm(y ~  x1 + I(x1 ^ 2) + z - 1)

Run Code Online (Sandbox Code Playgroud)

但直线似乎与二次多项式相切xt.我哪里做错了？

类似的问题:

逐行回归,在断点处连接一条直线和一条水平线
将V形曲线拟合到我的数据上(在Cross Validated上)

Answer 1

李哲源*_*李哲源 6

这是消化线性模型的理论和实现的一个很好的练习(可能很难).我的答案将包含两部分:

第1部分(这一部分)介绍了我使用的参数化以及这种分段回归如何简化为普通的最小二乘问题.提供了包括模型估计,断点选择和预测的R函数.
第2部分(另一部分)展示了一个玩具,可重复的例子,说明如何使用我定义的那些功能.

我必须使用不同的参数化,因为你在问题中给出的那个是错误的!您的参数化只能确保函数值的连续性,但不能保证一阶导数!这就是为什么你的拟合线与拟合的二次多项式相切的原因xt.

## generate design matrix
getX <- function (x, c) {
  x <- x - c
  cbind("beta0" = 1, "beta1" = x, "beta2" = pmin(x, 0) ^ 2)
  }

Run Code Online (Sandbox Code Playgroud)

est下面的函数包含.lm.fit(为了最大效率)估计和推断模型,给定c:

## `x`, `y` give data points; `c` is known break point
est <- function (x, y, c) {
  ## model matrix
  X <- getX(x, c)
  p <- dim(X)[2L]
  ## solve least squares with QR factorization
  fit <- .lm.fit(X, y)
  ## compute Pearson estimate of `sigma ^ 2`
  r <- c(fit$residuals)
  n <- length(r)
  RSS <- c(crossprod(r))
  sig2 <- RSS / (n - p)
  ## coefficients summary table
  beta <- fit$coefficients
  R <- "dimnames<-"(fit$qr[1:p, ], NULL)
  Rinv <- backsolve(R, diag(p))
  se <- sqrt(rowSums(Rinv ^ 2) * sig2)
  tstat <- beta / se
  pval <- 2 * pt(abs(tstat), n - p, lower.tail = FALSE)
  tab <- matrix(c(beta, se, tstat, pval), nrow = p, ncol = 4L,
                dimnames = list(dimnames(X)[[2L]], 
                c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
  ## 2 * negative log-likelihood
  nega2logLik <- n * log(2 * pi * sig2) + (n - p)
  ## AIC / BIC
  aic <- nega2logLik + 2 * (p + 1)
  bic <- nega2logLik + log(n) * (p + 1)
  ## multiple R-squared and adjusted R-squared
  TSS <- c(crossprod(y - sum(y) / n))
  r.squared <- 1 - RSS / TSS
  adj.r.squared <- 1 - sig2 * (n - 1) / TSS
  ## return
  list(coefficients = beta, residuals = r, fitted.values = c(X %*% beta),
       R = R, sig2 = sig2, coef.table = tab, aic = aic, bic = bic, c = c,
       RSS = RSS, r.squared = r.squared, adj.r.squared = adj.r.squared)
  }

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,它还会返回各种摘要,就像summary.lm调用了一样.现在让我们编写另一个包装器函数choose.c.它勾画RSS反对c.grid,并选择返回的最佳模式c.

choose.c <- function (x, y, c.grid) {
  if (is.unsorted(c.grid)) stop("'c.grid' in not increasing")
  ## model list
  lst <- lapply(c.grid, est, x = x, y = y)
  ## RSS trace
  RSS <- sapply(lst, "[[", "RSS")
  ## verbose
  plot(c.grid, RSS, type = "b", pch = 19)
  ## find `c` / the model minimizing `RSS`
  lst[[which.min(RSS)]]
  }

Run Code Online (Sandbox Code Playgroud)

到现在为止还挺好.为了完成这个故事,我们还想要一个predict例程.

pred <- function (model, x.new) {
  ## prediction matrix
  X <- getX(x.new, model$c)
  p <- dim(X)[2L]
  ## predicted mean
  fit <- X %*% model$coefficients
  ## prediction standard error
  Qt <- forwardsolve(t(model$R), t(X))
  se <- sqrt(colSums(Qt ^ 2) * model$sig2)
  ## 95%-confidence interval
  alpha <- qt(0.025, length(model$residuals) - p)
  lwr <- fit + alpha * se
  upr <- fit - alpha * se
  ## return
  matrix(c(fit, se, lwr, upr), ncol = 4L,
         dimnames = list(NULL, c("fit", "se", "lwr", "upr")))
  }

Run Code Online (Sandbox Code Playgroud)

Answer 2

李哲源*_*李哲源 6

在本节中,我将演示一个可重现的例子.请确保您在其他答案中定义了源函数.

## we first generate a true model
set.seed(0)
x <- runif(100)  ## sample points on [0, 1]
beta <- c(0.1, -0.2, 2)  ## true coefficients
X <- getX(x, 0.6)  ## model matrix with true break point at 0.6
y <- X %*% beta + rnorm(100, 0, 0.08)  ## observations with Gaussian noise
plot(x, y)

Run Code Online (Sandbox Code Playgroud)

现在,假设我们不知道c,我们想在均匀间隔的网格上搜索:

c.grid <- seq(0.1, 0.9, 0.05)
fit <- choose.c(x, y, c.grid)
fit$c

Run Code Online (Sandbox Code Playgroud)

RSS选择了0.55.这与真实值略有不同0.6,但从图中我们看到RSS曲线之间的变化不大,[0.5, 0.6]所以我很满意.

生成的模型fit包含丰富的信息:

#List of 12
# $ coefficients : num [1:3] 0.114 -0.246 2.366
# $ residuals    : num [1:100] 0.03279 -0.01515 0.21188 -0.06542 0.00763 ...
# $ fitted.values: num [1:100] 0.0292 0.3757 0.2329 0.1087 0.0263 ...
# $ R            : num [1:3, 1:3] -10 0.1 0.1 0.292 2.688 ...
# $ sig2         : num 0.00507
# $ coef.table   : num [1:3, 1:4] 0.1143 -0.2456 2.3661 0.0096 0.0454 ...
#  ..- attr(*, "dimnames")=List of 2
#  .. ..$ : chr [1:3] "beta0" "beta1" "beta2"
#  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
# $ aic          : num -240
# $ bic          : num -243
# $ c            : num 0.55
# $ RSS          : num 0.492
# $ r.squared    : num 0.913
# $ adj.r.squared: num 0.911

Run Code Online (Sandbox Code Playgroud)

我们可以提取系数的汇总表:

fit$coef.table
#        Estimate  Std. Error   t value     Pr(>|t|)
#beta0  0.1143132 0.009602697 11.904286 1.120059e-20
#beta1 -0.2455986 0.045409356 -5.408546 4.568506e-07
#beta2  2.3661097 0.169308226 13.975161 5.730682e-25

Run Code Online (Sandbox Code Playgroud)

最后,我们想看一些预测图.

x.new <- seq(0, 1, 0.05)
p <- pred(fit, x.new)

head(p)
#           fit     se.fit       lwr       upr
#[1,] 0.9651406 0.02903484 0.9075145 1.0227668
#[2,] 0.8286400 0.02263111 0.7837235 0.8735564
#[3,] 0.7039698 0.01739193 0.6694516 0.7384880
#[4,] 0.5911302 0.01350837 0.5643199 0.6179406
#[5,] 0.4901212 0.01117924 0.4679335 0.5123089
#[6,] 0.4009427 0.01034868 0.3804034 0.4214819

Run Code Online (Sandbox Code Playgroud)

我们可以制作一个情节:

plot(x, y, cex = 0.5)
matlines(x.new, p[,-2], col = c(1,2,2), lty = c(1,2,2), lwd = 2)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，2 月前
查看次数：	1334 次
最近记录：	6 年，10 月前