--- output: pdf_document: citation_package: natbib latex_engine: pdflatex toc: true toc_depth: 2 includes: in_header: vignette_head.tex keep_tex: true title: "Robust Standard Errors in Small Samples" author: "Michal Kolesár" date: "`r format(Sys.time(), '%B %d, %Y')`" geometry: margin=1in fontfamily: mathpazo bibliography: library.bib fontsize: 11pt vignette: > %\VignetteIndexEntry{dfadjust} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE, cache=FALSE} library("knitr") knitr::opts_knit$set(self.contained = FALSE) knitr::opts_chunk$set(tidy = TRUE, collapse=TRUE, comment = "#>", tidy.opts=list(blank=FALSE, width.cutoff=55)) oldoptions <- options(digits=3) ``` # Description[^1] This package implements small-sample degrees of freedom adjustments to robust and cluster-robust standard errors in linear regression, as discussed in @ImKo16. The implementation can handle models with fixed effects, and cases with a large number of observations or clusters [^1]: We thank Bruce Hansen for comments and Ulrich Müller for suggesting to us a version of Lemma 2 below. ```{r setup} library(dfadjust) ``` To give some examples, let us construct an artificial dataset with 11 clusters ```{r} set.seed(7) d1 <- data.frame(y=rnorm(1000), x1=c(rep(1, 3), rep(0, 997)), x2=c(rep(1, 150), rep(0, 850)), x3=rnorm(1000), cl=as.factor(c(rep(1:10, each=50), rep(11, 500)))) ``` Let us first run a regression of `y` on `x1`. This is a case in which, in spite of moderate data size, the effective number of observations is small since there are only three treated units: ```{r} r1 <- lm(y~x1, data=d1) ## No clustering dfadjustSE(r1) ``` We can see that the usual robust standard errors (`HC1 se`) are much smaller than the effective standard errors (`Adj. se`), which are computed by taking the `HC2` standard errors and applying a degrees of freedom adjustment. Now consider a cluster-robust regression of `y` on `x2`. There are only 3 treated clusters, so the effective number of observations is again small: ```{r} r1 <- lm(y~x2, data=d1) # Default Imbens-Kolesár method dfadjustSE(r1, clustervar=d1$cl) # Bell-McCaffrey method dfadjustSE(r1, clustervar=d1$cl, IK=FALSE) ``` Now, let us run a regression of `y` on `x3`, with fixed effects. Since we're only interested in `x3`, we specify that we only want inference on the second element (the first one being the intercept): ```{r} r1 <- lm(y~x3+cl, data=d1) dfadjustSE(r1, clustervar=d1$cl, ell=2) dfadjustSE(r1, clustervar=d1$cl, ell=2, IK=FALSE) ``` Finally, an example in which the clusters are large. We have 500,000 observations: ```{r} d2 <- do.call("rbind", replicate(500, d1, simplify = FALSE)) d2$y <- rnorm(length(d2$y)) r2 <- lm(y~x2, data=d2) summary(r2) # Default Imbens-Kolesár method dfadjustSE(r2, clustervar=d2$cl) # Bell-McCaffrey method dfadjustSE(r2, clustervar=d2$cl, IK=FALSE) ``` # Methods This section describes the implementation of the @ImKo16 and @BeMc02 degrees of freedom adjustments. There are $S$ clusters, and we observe $n_{s}$ observations in cluster $s$, for a total of $n=\sum_{s=1}^{S}n_{s}$ observations. We handle the case with independent observations by letting each observation be in its own cluster, with $S=n$. Consider the linear regression of a scalar outcome $Y_{i}$ onto a $p$-vector of regressors $X_{i}$, \begin{equation*} Y_{i}=X_{i}'\beta+u_{i},\qquad E[u_{i}\mid X_{i}]=0. \end{equation*} We're interested in inference on $\ell'\beta$ for some fixed vector $\ell\in\mathbb{R}^{p}$. Let $X$, $u$, and $Y$ denote the design matrix, and error and outcome vectors, respectively. For any $n\times k$ matrix $M$, let ${M}_{s}$ denote the $n_{s}\times k$ block corresponding to cluster $s$, so that, for instance, $Y_{s}$ corresponds to the outcome vector in cluster $s$. For a positive semi-definite matrix ${M}$, let ${M}^{1/2}$ be a matrix satisfying ${{M}^{1/2}}'{M}^{1/2}={M}$, such as its symmetric square root or its Cholesky decomposition. Assume that \begin{equation*} E[u_{s}u_{s}'\mid X]=\Omega_{s},\quad\text{and}\quad E[u_{s}u_{t}'\mid X]=0\quad\text{if $s\neq t$}. \end{equation*} Denote the conditional variance matrix of $u$ by $\Omega$, so that $\Omega_{s}$ is the block of $\Omega$ corresponding to cluster $s$. We estimate $\ell'\beta$ using OLS. In `R`, the OLS estimator is computed via a QR decomposition, $X=QR$, where $Q' Q=I$ and $R$ is upper-triangular, so we can write the estimator as \begin{equation*} \ell'\hat{\beta}=\ell'\left(\sum_{s}X_{s}'X_{s}\right)^{-1}\sum_{s}X_{s}Y_{s} =\tilde{\ell}'\sum_{s}Q_{s}'Y_{s},\qquad \tilde{\ell}={R^{-1}}'\ell. \end{equation*} It has variance \begin{equation*} V:= \var(\ell'\hat{\beta}\mid X) =\ell'\left(X' X\right)^{-1} \sum_{s}X_{s}'\Omega_{s}X_{s}\left(X' X\right)^{-1}\ell =\tilde{\ell}'\sum_{s}Q_{s}'\Omega_{s}Q_{s} \tilde{\ell}. \end{equation*} ## Variance estimate We estimate $V$ using a variance estimator that generalizes the HC2 variance estimator to clustering. Relative to the LZ2 estimator described in @ImKo16, we use a slight modification that allows for fixed effects: \begin{equation*} \hat{V}=\ell'(X' X)^{-1}\sum_{s}^{}X'_{s}{A}_{s}\hat{u}_{s}\hat{u}_{s}' {A}_{s}'X_{s}(X' X)^{-1}\ell =\ell' R^{-1}\sum_{s}^{}Q'_{s}{A}_{s}\hat{u}_{s}\hat{u}_{s}' {A}_{s}'Q_{s}{R'}^{-1}\ell =\sum_{s=1}^{S}(\hat{u}_{s}'a_{s})^{2}, \end{equation*} where \begin{equation*} \hat{u}_{s}:=Y_{s}-X_{s}\hat{\beta} =u_{s}-Q_{s}Q' u,\qquad a_{s}={A}_{s}'Q_{s}\tilde{\ell}, \end{equation*} and $A_s$ is a generalized inverse of the symmetric square root of $I-Q_s Q_s'$, the block of the hat matrix corresponding to cluster $s$. In presence of cluster-specific fixed effects, $I-Q_s Q_s'$ is not generally invertible, which necessitates taking a generalized inverse. So long as the vector $\ell$ doesn't load on these fixed effects, $\hat{V}$ will be unbiased under homoskedasticity, as the next result, which slightly generalizes the Theorem 1 in @PuTi18, shows. \begin{lemma} Suppose that $X=(W,L)$ is full rank, and suppose that the vector $\ell$ loads only on elements of $W$. Let $\ddot{W}$ denote the residual from projecting $W$ onto $L$, and suppose that for each cluster $s$, (i) $L_s'\ddot{W}_s=0$ and that (ii) $\sum_{k=1}^S I(k\neq s)\ddot{W}_k'\ddot{W}_k$ is full rank. Then $\hat{V}$ is unbiased under homoskedasticity. \end{lemma} The proof is given in the last section. By definition of projection, $L$ and $\ddot{W}$ are orthogonal. Condition (i) of the lemma strengthens this requirement to orthogonality within each cluster. It holds if $L$ corresponds to a vector of cluster fixed effects, or more generally if $L$ contains cluster-specific variables. Condition (ii) ensures that after partialling out $L$, it is feasible to run leave-one-cluster-out regressions. Without clustering, the condition is equivalent to the requirement that the partial leverages associated with $\ddot{W}$ are smaller than one.[^2] [^2]: To see this, let $H=\ddot{W}(\ddot{W}'\ddot{W})^{-1}\ddot{W}'$ denote the partial projection matrix. Since $H=H^{2}$, \begin{equation*}H_{ii}-H_{ii}^{2}=\sum_{j\neq i}H_{i j}H_{j i}=\ddot{W}_{i}'(\ddot{W}'\ddot{W})^{-1} [\sum_{j\neq i}\ddot{W}_{j}\ddot{W}_{j}'] (\ddot{W}'\ddot{W})^{-1}\ddot{W}_{i}.\end{equation*} so that $H_{ii}=1$ iff $\sum_{j\neq i}\ddot{W}_{j}\ddot{W}_{j}'$ is reduced rank. If the observations are independent, the vector of leverages $(Q_{1}'Q_{1}, \dotsc, Q_{n}'Q_{n})$ can be computed directly using the `stats::hatvalues` function. In this case, we use this function to compute $A_{i}=1/\sqrt{1-Q_{i}'Q_{i}}$ directly, and we then compute $a_{i}=A_{i}Q_{i}'\tilde{\ell}$ using vector operations. For the case with clustering, computing an inverse of $I-Q_{s}Q_{s}'$ can be expensive or even infeasible if the cluster size $n_s$ is large. We therefore use the following result, which allows us to compute $a_{s}$ by computing a spectral decomposition of a $p\times p$ matrix. \begin{lemma} Let $Q_{s}'Q_{s}=\sum_{i=1}^{p}\lambda_{is}r_{is}r_{is}'$ be the spectral decomposition of $Q_{s}'Q_{s}$. Then $a_s=Q_{s}D_{s} \tilde{\ell}$, where $\qquad D_{s}=\sum_{i\colon \lambda_{i}\neq 1}(1-\lambda_{i})^{-1/2}r_{is}r_{is}'$. \end{lemma} The lemma follows from the fact that $I-Q_{s}Q_{s}'$ has eigenvalues $1-\lambda_{is}$ and eigenvectors $Q_{s}r_{is}$. More precisely, let $Q_{s}=\sum_{i}\lambda_{is}^{1/2}u_{is}r_{is}'$ denote the singular value decomposition of $Q_{s}$, so that $I-Q_{s}Q_{s}'=\sum_{i}(1-\lambda_{is})u_{is}u_{is}'$, and we can take $A_{s}=\sum_{i\colon \lambda_{is}\neq 1}(1-\lambda_{is})^{-1/2}u_{is}u_{is}'$. Then, \begin{equation*} A_{s}'Q_{s}=\sum_{i\colon \lambda_{i}\neq 1} (1-\lambda_{is})^{-1/2}\lambda_{is}^{1/2} u_{is} r_{is}' =Q_{s}\sum_{i\colon \lambda_{i}\neq 1}(1-\lambda_{is})^{-1/2}r_{is} r_{is}', \end{equation*} where the second equality uses $Q_{s}r_{is}=\lambda_{is}^{1/2}u_{is}$. ## Degrees of freedom correction Let $G$ be an $n\times S$ matrix with columns $(I-QQ')_{s}'a_{s}$. Then the @BeMc02 adjustment sets the degrees of freedom to \begin{equation*} f_{\text{BM}}=\frac{\trace(G' G)^{2}}{\trace((G' G)^{2})}. \end{equation*} Since $(G' G)_{st}=a_{s}'(I-QQ')_{s}(I-QQ)_{t}'a_{t}=a_{s}(\1{s=t}-Q_{s}Q_{t}')a_{t}$, the matrix $G' G$ can be efficiently computed as \begin{equation*} G' G=\diag(a_{s}'a_{s})-BB'\qquad B_{s k}=a_{s}'Q_{s k}. \end{equation*} Note that $B$ is an $S\times p$ matrix, so that computing the degrees of freedom adjustment only involves $p\times p$ matrices: \begin{align*} f_{\text{BM}}=\frac{(\sum_{s}a_{s}'a_{s}-\sum_{s,k}B_{s k}^{2})^{2}}{ \sum_{s}(a_{s}'a_{s})^{2}-2\sum_{s,k}(a_{s}'a_{s})B_{s k}^{2}+\sum_{s,t}(B_{s}'B_{t})^{2} }. \end{align*} If the observations are independent, we compute $B$ directly as `B <- a*Q`, and since $a_{i}$ is a scalar, we have \begin{equation*} f_{\text{BM}}=\frac{(\sum_{i}a_{i}^{2}-\sum_{s k}B_{s k}^{2})^{2}}{ \sum_{i}a_{i}^{4}-2\sum_{i}a_{i}^{2}B_{i}'B_{i}+\sum_{i, j}(B_{i}'B_{j})^{2}}. \end{equation*} The @ImKo16 degrees of freedom adjustment instead sets \begin{equation*} f_{IK}=\frac{\trace({G}'\hat{\Omega} G)^{2}}{\trace(({G}'\hat{\Omega} G)^{2})}, \end{equation*} where $\hat{\Omega}$ is an estimate of the @moulton86 model of the covariance matrix, under which $\Omega_{s}=\sigma_{\epsilon}^{2}I_{n_{s}}+\rho\iota_{n_{s}}\iota_{n_{s}}'$. Using simple algebra, one can show that in this case, \begin{equation*} G'\Omega G=\sigma_{\epsilon}^{2}\diag(a_{s}'a_{s}) -\sigma_{\epsilon}^{2}BB'+\rho (D-BF')(D-BF')', \end{equation*} where \begin{equation*} F_{s k}=\iota_{n_{s}}'Q_{s k},\qquad D=\diag(a_{s}'\iota_{n_{s}}) \end{equation*} which can again be computed even if the clusters are large. The estimate $\hat{\Omega}$ replaces $\sigma_{\epsilon}^{2}$ and $\rho$ with analog estimates. ```{r cleanup, include=FALSE} options(oldoptions) ``` # Proof of Lemma 1 The estimator of the block of $V$ associated with $W$ implied by $\hat{V}$ is given by \begin{equation*} (\ddot{W}'\ddot{W})^{-1} \sum_{s}\ddot{W}_{s}'A_{s} (I-QQ')_{s} u u'(I-QQ')_{s}'A_{s}'\ddot{W}_{s} (\ddot{W}'\ddot{W})^{-1}, \end{equation*} which is unbiased under homoskedasticity if for each $s$, \begin{equation}\label{equation:1} \ddot{W}_{s}'A_{s}(I-Q_{s}Q_{s}')A_{s}'\ddot{W}_{s} = \ddot{W}_{s}'\ddot{W}_{s}. \end{equation} We will show that \eqref{equation:1} holds. To this end, we first claim that under conditions (i) and (ii), $\ddot{W}_{s}$ is in the column space of $I-Q_{s}Q_{s}'$ (a claim that's trivial if this matrix is full rank). Decompose $I-QQ'=I-H_{\ddot{W}}-H_{L}$, where $H_{\ddot{W}}$ and $H_{L}$ are hat matrices associated with $\ddot{W}$ and $L$. The block associated with cluster $s$ can thus be written as $I-Q_{s}Q_{s}'=I-L_{s}(L' L)^{-1}L_{s}'-\ddot{W}_{s}(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}'$. Let $B_{s}=\ddot{W}_{s}(\ddot{W}'\ddot{W}-\ddot{W}_{s}'\ddot{W}_{s})^{-1}\ddot{W}'\ddot{W}$, which is well-defined under condition (ii). Then, using condition (i), we get \begin{equation*} \begin{split} (I-Q_{s}Q_{s}')B_{s} &=(I-\ddot{W}_{s}(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}')B_{s}\\ &=\ddot{W}_{s}(I-(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}'\ddot{W}_{s}) (\ddot{W}'\ddot{W}-\ddot{W}_{s}'\ddot{W}_{s})^{-1}\ddot{W}'\ddot{W}=\ddot{W}_{s}, \end{split} \end{equation*} proving the claim. Letting $C$ denote the symmetric square root of $I-Q_{s}Q_{s}'$, the left-hand side of \eqref{equation:1} can therefore be written as \begin{equation*} \ddot{W}_{s}'A_{s}(I-Q_{s}Q_{s}')A_{s}'\ddot{W}_{s} = B_{s}'CC A_{s} CC A_{s}' CC B_{s}=\ddot{W}_{s}'\ddot{W}_{s}, \end{equation*} where the second equality follows by the definition of a generalized inverse. # References