Ale*_*lex 8 regression r standard-error panel-data plm
我在R(时间和横截面)中有一个面板数据集,并且想要计算由两个维度聚类的标准误差,因为我的残差是双向相关的.谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/提供了执行此操作的功能.它似乎有点特别,所以我想知道是否有一个已经过测试的包并且这样做了吗?
我知道sandwichHAC标准错误,但它没有做双聚类(即沿着两个维度).
Gra*_*ant 10
这是一个老问题。但是看到人们似乎仍然在使用它,我想我会提供一些现代方法来在 R 中进行多路聚类:
fixest::feols()library(fixest)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols
## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode) ## add third cluster var (not in original model call)
## etc.
Run Code Online (Sandbox Code Playgroud)
lfe::felm()library(lfe)
## Note the third "| 0 " slot means we're not using IV
est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)
Run Code Online (Sandbox Code Playgroud)
sandwichlibrary(sandwich)
library(lmtest)
est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork)
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Run Code Online (Sandbox Code Playgroud)
Aaaand,只是为了强调速度。这是三种不同方法的基准(使用两个固定 FE 和双向聚类)。
est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork),
vcov = vcovCL, cluster = ~ race + year)}
microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> est_feols() 11.94122 11.96158 12.55835 11.98193 12.86692 13.75191 3 a
#> est_felm() 87.18064 95.89905 100.69589 104.61746 107.45352 110.28957 3 b
#> est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886 3 c
Run Code Online (Sandbox Code Playgroud)
对于面板回归,plm包可以沿两个维度估计聚类SE.
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
Run Code Online (Sandbox Code Playgroud)
所以现在你可以获得集群的SE:
##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.022189 1.3376 0.1811
x 1.034833 0.031679 32.6666 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.064580 0.4596 0.6458
x 1.034833 0.052465 19.7243 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run Code Online (Sandbox Code Playgroud)
有关更多详情,请参阅
但是,只有当您的数据被强制转换为a时,上述方法才有效pdata.frame.如果你有失败,它将会失败"duplicate couples (time-id)".在这种情况下,您仍然可以聚类,但只能沿着一个维度聚类.
诱骗plm到,你有一个正确的面板数据,通过指定唯一的定势思维一个指标:
fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
Run Code Online (Sandbox Code Playgroud)
所以现在你可以获得集群的SE:
##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run Code Online (Sandbox Code Playgroud)
您还可以使用此变通方法按更高维度或更高级别(例如industry或country)进行集群.但是在这种情况下,您将无法使用group(或time)effects,这是该方法的主要限制.
另一种适用于面板和其他类型数据的方法是multiwayvcov包.它允许双重聚类,但也可以在更高的维度进行聚类.根据软件包的网站,它是对Arai代码的改进:
- 由于缺失,观察的透明处理下降
- 完全多路(或n路,或n维或多维)聚类
使用Petersen数据和cluster.vcov():
library("lmtest")
library("multiwayvcov")
data(petersen)
m1 <- lm(y ~ x, data = petersen)
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029680 0.065066 0.4561 0.6483
## x 1.034833 0.053561 19.3206 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run Code Online (Sandbox Code Playgroud)
Arai的功能可用于聚类标准错误.他有另一个版本用于多维聚类:
mcl <- function(dat,fm, cluster1, cluster2){
attach(dat, warn.conflicts = F)
library(sandwich);library(lmtest)
cluster12 = paste(cluster1,cluster2, sep="")
M1 <- length(unique(cluster1))
M2 <- length(unique(cluster2))
M12 <- length(unique(cluster12))
N <- length(cluster1)
K <- fm$rank
dfc1 <- (M1/(M1-1))*((N-1)/(N-K))
dfc2 <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum))
u2j <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum))
u12j <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum))
vc1 <- dfc1*sandwich(fm, meat=crossprod(u1j)/N )
vc2 <- dfc2*sandwich(fm, meat=crossprod(u2j)/N )
vc12 <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
vcovMCL <- vc1 + vc2 - vc12
coeftest(fm, vcovMCL)}
Run Code Online (Sandbox Code Playgroud)
有关参考和用法示例,请参阅:
Frank Harrell 的包rms(以前被命名为Design)有一个我在聚类时经常使用的函数:robcov。
?robcov例如,请参阅 的这一部分。
cluster: a variable indicating groupings. \xe2\x80\x98cluster\xe2\x80\x99 may be any type of\n vector (factor, character, integer). NAs are not allowed.\n Unique values of \xe2\x80\x98cluster\xe2\x80\x99 indicate possibly correlated\n groupings of observations. Note the data used in the fit and\n stored in \xe2\x80\x98fit$x\xe2\x80\x99 and \xe2\x80\x98fit$y\xe2\x80\x99 may have had observations\n containing missing values deleted. It is assumed that if any\n NAs were removed during the original model fitting, an\n \xe2\x80\x98naresid\xe2\x80\x99 function exists to restore NAs so that the rows of\n the score matrix coincide with \xe2\x80\x98cluster\xe2\x80\x99. If \xe2\x80\x98cluster\xe2\x80\x99 is\n omitted, it defaults to the integers 1,2,...,n to obtain the\n "sandwich" robust covariance matrix estimate.\nRun Code Online (Sandbox Code Playgroud)\n