R：使用“ weights”参数和使用手动重新加权的数据时，lm（）结果有所不同

Question

R：使用“ weights”参数和使用手动重新加权的数据时，lm（）结果有所不同

Mag*_*ean 2 regression r linear-regression lm

为了用错误术语纠正异方差，我在R中运行以下加权最小二乘回归：

#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)

#Weighted Residuals:
#     Min       1Q   Median       3Q      Max 
#-1.83779 -0.33226  0.02011  0.25135  1.48516 

#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
#(Intercept) -3.939440   0.609991  -6.458 1.62e-09 ***
#q            0.175019   0.070101   2.497 0.013696 *  
#q2           0.048790   0.005613   8.693 8.49e-15 ***
#b            0.473891   0.134918   3.512 0.000598 ***
#c            0.119551   0.125430   0.953 0.342167    
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared:  0.9639,   Adjusted R-squared:  0.9628 
#F-statistic: 933.6 on 4 and 140 DF,  p-value: < 2.2e-16

Run Code Online (Sandbox Code Playgroud)

其中“加权”是q用于对观测值进行加权的变量（变量的函数）。q2很简单q^2。

现在，要仔细检查结果，我通过创建新的加权变量来手动加权变量：

mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting

Run Code Online (Sandbox Code Playgroud)

并运行以下回归分析，没有权重选项，也没有常数-由于常数是加权的，因此原始预测变量矩阵中的1列现在应等于变量权重：

Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls, 
data = mydata)

#Residuals:
#     Min       1Q   Median       3Q      Max 
#-2.38404 -0.55784  0.01922  0.49838  2.62911 

#Coefficients:
#         Estimate Std. Error t value Pr(>|t|)    
#weighting -4.125559   0.579093  -7.124 5.05e-11 ***
#q.wls    0.217722   0.081851   2.660 0.008726 ** 
#q2.wls   0.045664   0.006229   7.330 1.67e-11 ***
#b.wls    0.466207   0.121429   3.839 0.000186 ***
#c.wls    0.133522   0.112641   1.185 0.237876    
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared:  0.9823,   Adjusted R-squared:  0.9817 
#F-statistic:  1556 on 5 and 140 DF,  p-value: < 2.2e-16

Run Code Online (Sandbox Code Playgroud)

如您所见，结果相似但不相同。在手动加权变量时，我做错了什么吗？或者“权重”选项的作用不只是简单地将变量乘以加权向量？

Answer 1

李哲源*_*李哲源 7

只要您正确进行手动加权，就不会出现差异。

因此正确的方法是：

X <- model.matrix(~ q + q2 + b + c, mydata)  ## non-weighted model matrix (with intercept)
w <- mydata$weighting  ## weights
rw <- sqrt(w)    ## root weights
y <- mydata$a    ## non-weighted response
X_tilde <- rw * X    ## weighted model matrix (with intercept)
y_tilde <- rw * y    ## weighted response

## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)

Run Code Online (Sandbox Code Playgroud)

尽管通常建议使用lm.fit和lm.wfit直接传递矩阵时：

matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)

Run Code Online (Sandbox Code Playgroud)

但是，当使用这些内部子例程lm.fit和时lm.wfit，要求所有输入都是不包含的完整情况NA，否则基础C例程stats:::C_Cdqrls将抱怨。

如果仍然要使用公式接口而不是矩阵，则可以执行以下操作：

## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting

fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)

fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
                 data = mydata)

Run Code Online (Sandbox Code Playgroud)

可重现的例子

让我们使用R的内置数据集trees。使用head(trees)检查这个数据集。NA该数据集中没有任何内容。我们旨在拟合模型：

Height ~ Girth + Volume

Run Code Online (Sandbox Code Playgroud)

权重介于1到2之间：

set.seed(0); w <- runif(nrow(trees), 1, 2)

Run Code Online (Sandbox Code Playgroud)

我们通过加权回归来拟合此模型，方法是将权重传递给lm，或者手动转换数据lm并不进行权重调用：

X <- model.matrix(~ Girth + Volume, trees)  ## non-weighted model matrix (with intercept)
rw <- sqrt(w)    ## root weights
y <- trees$Height    ## non-weighted response
X_tilde <- rw * X    ## weighted model matrix (with intercept)
y_tilde <- rw * y    ## weighted response

fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)

#Coefficients:
#X(Intercept)        XGirth       XVolume  
#     83.2127       -1.8639        0.5843

fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)

#Coefficients:
#X_tilde(Intercept)        X_tildeGirth       X_tildeVolume  
#           83.2127             -1.8639              0.5843

Run Code Online (Sandbox Code Playgroud)

因此，的确，我们看到了相同的结果。

另外，我们可以使用lm.fit和lm.wfit：

matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)

Run Code Online (Sandbox Code Playgroud)

我们可以通过以下方式检查系数：

matfit_by_wls$coefficients
#(Intercept)       Girth      Volume 
# 83.2127455  -1.8639351   0.5843191 

matfit_by_ols$coefficients
#(Intercept)       Girth      Volume 
# 83.2127455  -1.8639351   0.5843191

Run Code Online (Sandbox Code Playgroud)

同样，结果是相同的。

归档时间：	9 年，5 月前
查看次数：	5287 次
最近记录：	6 年，6 月前