使用固定的r2模拟逻辑回归的数据

Question

使用固定的r2模拟逻辑回归的数据

Lio*_*ens 7 r variance logistic-regression

我想模拟逻辑回归的数据,我可以事先指定其解释的方差.看看下面的代码.我模拟了四个自变量,并指定每个logit系数的大小应为log(2)= 0.69.这很好用,解释的方差(我报告Cox和Snell的r2)是0.34.

但是,我需要指定回归系数,使得预先指定的r2将来自回归.因此,如果我想产生一个让我们说的精确到0.1的r2.如何指定系数？我有点挣扎着......

# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                0.25,1,0.25,0.25,   
                0.25,0.25,1,0.25,    
                0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4) 
n.obs <- 500000 

library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Create latent continuous response variable 
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4

# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)

# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)

Run Code Online (Sandbox Code Playgroud)

输出是:

Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.7536  -0.7795  -0.0755   0.7813   3.3382  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.002098   0.003544  -0.592    0.554    
V1           0.691034   0.004089 169.014   <2e-16 ***
V2           0.694052   0.004088 169.776   <2e-16 ***
V3           0.693222   0.004079 169.940   <2e-16 ***
V4           0.699091   0.004081 171.310   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 693146  on 499999  degrees of freedom
Residual deviance: 482506  on 499995  degrees of freedom
AIC: 482516

Number of Fisher Scoring iterations: 5

Run Code Online (Sandbox Code Playgroud)

Cox和Snell的r2给出:

library(pscl)
pR2(logreg)["r2ML"]

> pR2(logreg)["r2ML"]
 r2ML 
0.3436523

Run Code Online (Sandbox Code Playgroud)

Answer 1

Fre*_*lia 1

R 平方（及其变体）是一个随机变量，因为它取决于您的模拟数据。如果多次使用完全相同的参数模拟数据，则很可能每次都会得到不同的 R 平方值。因此，仅通过控制参数无法生成 R 平方恰好为 0.1 的模拟。

另一方面，由于它是一个随机变量，您可以根据条件分布（以 R 平方的固定值为条件）模拟数据，但您需要找出这些分布是什么样的（数学可能会变得非常实际）这里很难看，交叉验证更适合这部分）。

归档时间：	7 年，8 月前
查看次数：	446 次
最近记录：	7 年，8 月前