面板数据的双集群标准错误

Ale*_*lex 8 regression r standard-error panel-data plm

我在R(时间和横截面)中有一个面板数据集,并且想要计算由两个维度聚类的标准误差,因为我的残差是双向相关的.谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/提供了执行此操作的功能.它似乎有点特别,所以我想知道是否有一个已经过测试的包并且这样做了吗?

我知道sandwichHAC标准错误,但它没有做双聚类(即沿着两个维度).

Gra*_*ant 10

这是一个老问题。但是看到人们似乎仍然在使用它,我想我会提供一些现代方法来在 R 中进行多路聚类:

选项 1(最快): fixest::feols()

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols

## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode)  ## add third cluster var (not in original model call)
## etc.
Run Code Online (Sandbox Code Playgroud)

选项 2(快速): lfe::felm()

library(lfe)

## Note the third "| 0 " slot means we're not using IV

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)
Run Code Online (Sandbox Code Playgroud)

选项 3(较慢,但灵活): sandwich

library(sandwich)
library(lmtest)

est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Run Code Online (Sandbox Code Playgroud)

基准

Aaaand,只是为了强调速度。这是三种不同方法的基准(使用两个固定 FE 和双向聚类)。

est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork) 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  11.94122  11.96158  12.55835  11.98193  12.86692  13.75191     3 a  
#>       est_felm()  87.18064  95.89905 100.69589 104.61746 107.45352 110.28957     3  b 
#>  est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886     3   c
Run Code Online (Sandbox Code Playgroud)


lan*_*oni 7

对于面板回归,plm包可以沿两个维度估计聚类SE.

使用M. Petersen的基准测试结果:

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
Run Code Online (Sandbox Code Playgroud)

所以现在你可以获得集群的SE:

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Run Code Online (Sandbox Code Playgroud)

有关更多详情,请参阅


但是,只有当您的数据被强制转换为a时,上述方法才有效pdata.frame.如果你有失败,它将会失败"duplicate couples (time-id)".在这种情况下,您仍然可以聚类,但只能沿着一个维度聚类.

诱骗plm到,你有一个正确的面板数据,通过指定唯一的定势思维一个指标:

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
Run Code Online (Sandbox Code Playgroud)

所以现在你可以获得集群的SE:

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run Code Online (Sandbox Code Playgroud)

您还可以使用此变通方法按更高维度更高级别(例如industrycountry)进行集群.但是在这种情况下,您将无法使用group(或time)effects,这是该方法的主要限制.


另一种适用于面板和其他类型数据的方法是multiwayvcov包.它允许双重聚类,但也可以在更高的维度进行聚类.根据软件包的网站,它是对Arai代码的改进:

  • 由于缺失,观察的透明处理下降
  • 完全多路(或n路,或n维或多维)聚类

使用Petersen数据和cluster.vcov():

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run Code Online (Sandbox Code Playgroud)


use*_*072 5

Arai的功能可用于聚类标准错误.他有另一个版本用于多维聚类:

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}
Run Code Online (Sandbox Code Playgroud)

有关参考和用法示例,请参阅:


Xu *_*ang 4

Frank Harrell 的包rms(以前被命名为Design)有一个我在聚类时经常使用的函数:robcov

\n\n

?robcov例如,请参阅 的这一部分。

\n\n
cluster: a variable indicating groupings. \xe2\x80\x98cluster\xe2\x80\x99 may be any type of\n      vector (factor, character, integer).  NAs are not allowed.\n      Unique values of \xe2\x80\x98cluster\xe2\x80\x99 indicate possibly correlated\n      groupings of observations. Note the data used in the fit and\n      stored in \xe2\x80\x98fit$x\xe2\x80\x99 and \xe2\x80\x98fit$y\xe2\x80\x99 may have had observations\n      containing missing values deleted. It is assumed that if any\n      NAs were removed during the original model fitting, an\n      \xe2\x80\x98naresid\xe2\x80\x99 function exists to restore NAs so that the rows of\n      the score matrix coincide with \xe2\x80\x98cluster\xe2\x80\x99. If \xe2\x80\x98cluster\xe2\x80\x99 is\n      omitted, it defaults to the integers 1,2,...,n to obtain the\n      "sandwich" robust covariance matrix estimate.\n
Run Code Online (Sandbox Code Playgroud)\n