使用固定 R2 模拟多元回归数据:如何合并相关变量?

Lio*_*ens 4 simulation r linear-regression

我想用四个预测变量来模拟多元线性回归的数据,我可以自由指定

\n\n
    \n
  • 模型的总体解释方差
  • \n
  • 所有标准化回归系数的大小
  • \n
  • 预测变量彼​​此相关的程度
  • \n
\n\n

我得出了一个满足前两点的解决方案,但基于所有自变量彼此不相关的假设(请参见下面的代码)。为了获得标准化回归系数,我从平均值 = 0 和方差 = 1 的总体变量中进行采样。

\n\n
# Specify population variance/covariance of four predictor variables that is sampled from\nsigma.1 <- matrix(c(1,0,0,0,   \n                    0,1,0,0,   \n                    0,0,1,0,    \n                    0,0,0,1),nrow=4,ncol=4)\n# Specify population means of four predictor varialbes that is sampled from \nmu.1 <- rep(0,4) \n\n# Specify sample size, true regression coefficients, and explained variance\nn.obs <- 50000 # to avoid sampling error problems\nintercept <- 0.5\nbeta <- c(0.4, 0.3, 0.25, 0.25)\nr2 <- 0.30\n\n# Create sample with four predictor variables\nlibrary(MASS)\nsample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))\n\n# Add error variable based on desired r2\nvar.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)\nsample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))\n\n# Add y variable based on true coefficients and desired r2\nsample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + \nbeta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon\n\n# Inspect model\nsummary(lm(y~V1+V2+V3+V4, data=sample1))\n\nCall:\nlm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-4.0564 -0.6310 -0.0048  0.6339  3.7119 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 0.496063   0.004175  118.82   <2e-16 ***\nV1          0.402588   0.004189   96.11   <2e-16 ***\nV2          0.291636   0.004178   69.81   <2e-16 ***\nV3          0.247347   0.004171   59.30   <2e-16 ***\nV4          0.253810   0.004175   60.79   <2e-16 ***\n---\nSignif. codes:  0 \xe2\x80\x98***\xe2\x80\x99 0.001 \xe2\x80\x98**\xe2\x80\x99 0.01 \xe2\x80\x98*\xe2\x80\x99 0.05 \xe2\x80\x98.\xe2\x80\x99 0.1 \xe2\x80\x98 \xe2\x80\x99 1\n\nResidual standard error: 0.9335 on 49995 degrees of freedom\nMultiple R-squared:  0.299, Adjusted R-squared:  0.299 \nF-statistic:  5332 on 4 and 49995 DF,  p-value: < 2.2e-16\n
Run Code Online (Sandbox Code Playgroud)\n\n

问题:如果我的预测变量是相关的,那么如果指定它们的方差/协方差矩阵且非对角线元素不为 0,则 r2 和回归系数与我想要的方式有很大不同,例如通过使用

\n\n
sigma.1 <- matrix(c(1,0.25,0.25,0.25,   \n                    0.25,1,0.25,0.25,   \n                    0.25,0.25,1,0.25,    \n                    0.25,0.25,0.25,1),nrow=4,ncol=4)\n
Run Code Online (Sandbox Code Playgroud)\n\n

有什么建议吗?\n谢谢!

\n

Lio*_*ens 6

经过更多思考我的问题后,我找到了答案。

\n\n

上面的代码首先对彼此之间具有给定相关度的预测变量进行采样。然后根据所需的 r2 值添加误差列。然后将所有这些加在一起,添加 y 列。

\n\n

到目前为止,产生错误的行只是

\n\n
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)\nsample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))\n
Run Code Online (Sandbox Code Playgroud)\n\n

因此它假设每个 beta 系数对 y 的解释贡献 100%(=自变量没有相互关系)。但如果 x 变量相关,则每个 beta 都不会(!)贡献 100%。这意味着误差的方差必须更大,因为变量之间存在一些可变性。

\n\n

大了多少?只需调整错误项的创建,如下所示:

\n\n
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)\n
Run Code Online (Sandbox Code Playgroud)\n\n

因此,自变量的相关程度只需添加 即可将其添加到误差方差中cor(sample1$V1, sample1$V2)。在相关性为 0.25 的情况下,例如使用

\n\n
sigma.1 <- matrix(c(1,0.25,0.25,0.25,   \n                0.25,1,0.25,0.25,   \n                0.25,0.25,1,0.25,    \n                0.25,0.25,0.25,1),nrow=4,ncol=4)\n
Run Code Online (Sandbox Code Playgroud)\n\n

cor(sample1$V1, sample1$V2)类似于 0.25,并且该值被添加到误差项的方差中。

\n\n

假设所有相互关系都相等,像这样,可以指定自变量之间任何程度的相互关系,以及真正的标准化回归系数和所需的 R2。

\n\n

证明:

\n\n
sigma.1 <- matrix(c(1,0.35,0.35,0.35,   \n                    0.35,1,0.35,0.35,   \n                    0.35,0.35,1,0.35,    \n                    0.35,0.35,0.35,1),nrow=4,ncol=4)\n# Specify population means of four predictor varialbes that is sampled from \nmu.1 <- rep(0,4) \n\n# Specify sample size, true regression coefficients, and explained variance\nn.obs <- 500000 # to avoid sampling error problems\nintercept <- 0.5\nbeta <- c(0.4, 0.3, 0.25, 0.25)\nr2 <- 0.15\n\n# Create sample with four predictor variables\nlibrary(MASS)\nsample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))\n\n# Add error variable based on desired r2\nvar.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)\nsample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))\n\n# Add y variable based on true coefficients and desired r2\nsample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + \n  beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon\n\n# Inspect model\nsummary(lm(y~V1+V2+V3+V4, data=sample1))\n\n> summary(lm(y~V1+V2+V3+V4, data=sample1))\n\nCall:\nlm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-10.7250  -1.3696   0.0017   1.3650   9.0460 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 0.499554   0.002869  174.14   <2e-16 ***\nV1          0.406360   0.003236  125.56   <2e-16 ***\nV2          0.298892   0.003233   92.45   <2e-16 ***\nV3          0.247581   0.003240   76.42   <2e-16 ***\nV4          0.253510   0.003241   78.23   <2e-16 ***\n---\nSignif. codes:  0 \xe2\x80\x98***\xe2\x80\x99 0.001 \xe2\x80\x98**\xe2\x80\x99 0.01 \xe2\x80\x98*\xe2\x80\x99 0.05 \xe2\x80\x98.\xe2\x80\x99 0.1 \xe2\x80\x98 \xe2\x80\x99 1\n\nResidual standard error: 2.028 on 499995 degrees of freedom\nMultiple R-squared:  0.1558,    Adjusted R-squared:  0.1557 \nF-statistic: 2.306e+04 on 4 and 499995 DF,  p-value: < 2.2e-16\n
Run Code Online (Sandbox Code Playgroud)\n