The package “sketching” is an R package that provides a variety of
random sketching methods via random subspace embeddings Researchers may
perform regressions using a sketch of data of size *m* instead of
the full sample of size *n* for a variety of reasons. Lee and Ng
(2022) considers the case when the regression errors do not have
constant variance and heteroskedasticity robust standard errors would
normally be needed for test statistics to provide accurate inference. It
is shown in Lee and Ng (2022) that estimates using data sketched by
random projections will behave *as if* the errors were
homoskedastic. Estimation by random sampling would not have this
property.

For more details, see the following papers.

- Lee, S. and Ng, S. (2022). “Least Squares Estimation Using Sketched Data with Heteroskedastic Errors,” arXiv:2007.07781, accepted for presentation at the Thirty-ninth International Conference on Machine Learning (ICML 2022).
- Lee, S. and Ng, S. (2020). “An Econometric Perspective on Algorithmic Subsampling,” Annual Review of Economics, 12(1): 45–80.

You can install the released version of sketching from CRAN with:

`install.packages("sketching")`

Alternatively, you can install the development version from GitHub with:

```
# install.packages("devtools") if devtools is not installed
::install_github("https://github.com/sokbae/sketching") devtools
```

We begin by calling the sketching package and fix the seed for reproducibility.

```
library(sketching)
<- 220526
seed set.seed(seed)
```

To illustrate the usefulness of the package, we use the well-known
Angrist and Krueger (1991) dataset. A particular extract of their
dataset is included in the package. Specifically, we look at the
ordinary least squares (OLS) and two stage least squares (2SLS)
estimates of the return to education in columns (1) and (2) of Table IV
in their paper. The dependent variable *y* is the log weekly
wages, the covariates *X* include years of education, the
intercept term and 9 year-of-birth dummies (*p*=11). Following
Angrist and Krueger (1991), the instruments *Z* are a full set of
quarter-of-birth times year-of-birth interactions (*q*=30). Their
idea was that season of birth is unlikely to be correlated with workers’
ability but can affect educational attainment because of compulsory
schooling laws. The full sample size is *n* = 247, 199.

We now define the variables accordingly.

```
<- AK$LWKLYWGE
Y <- AK$CNST
intercept <- AK$EDUC
X_end <- AK[,3:11]
X_exg <- cbind(X_exg, X_end)
X <- AK[,12:(ncol(AK)-1)]
Z_inst <- cbind(X_exg, Z_inst)
Z <- cbind(Y,intercept,X)
fullsample <- nrow(fullsample)
n <- ncol(X) d
```

We start with how to choose *m* in this application. Lee and
Ng (2020) highlights the tension between a large *m* required for
accurate inference, and a small *m* for computation efficiency.
From the algorithmic perspective, *m* needs to be chosen as small
as possible to achieve computational efficiency. However, statistical
analysis often cares about the variability of the estimates in repeated
sampling and a larger *m* may be desirable from the perspective
of statistical efficiency. An *inference-conscious* guide
*m*_{2} can be obtained as in Lee and Ng (2020) by
targeting the power at *γ̄* of a one-sided *t*-test for
given nominal size. Alternatively, a data-oblivious sketch size
*m*_{3} for a pre-specified *τ*_{2}(∞) can
be used.

We focus on the data-oblivious sketch size *m*_{3}, as
it is simpler to use. We set the target size *α* = 0.05 and the
target power *γ* = 0.8. Then,
*S*^{2}(*ᾱ*,*γ̄*) = 6.18. It remains to
specify *τ*_{2}(∞), which can be interpreted as the value
of *t*-statistic when the sample size is really large.

For OLS, we take *τ*_{2}(∞) = 10, resulting in
*m* = 15, 283 (about 6% of *n*).

```
# choice of m (data-oblivious sketch size)
<- 0.05
target_size <- 0.8
target_power <- (stats::qnorm(1-target_size) + stats::qnorm(target_power))^2
S_constant <- 10
tau_limit <- floor(n*S_constant/tau_limit^2)
m_ols print(m_ols)
#> [1] 15283
```

As a benchmark, we first obtain the OLS estimate using the full sample.

```
<- fullsample[,1]
ys <- as.matrix(fullsample[,-1])
reg <- lm(ys ~ reg - 1)
fullmodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(fullmodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.0801594610 0.0003552066
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(fullmodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.0801594610 0.0003946747
```

We now obtain the OLS estimates using a Bernoulli subsampling.

```
<- sketch(fullsample, m_ols, method = "bernoulli")
subsample <- subsample[,1]
ys <- subsample[,-1]
reg <- lm(ys ~ reg - 1)
submodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(submodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.080631991 0.001443075
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(submodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.080631991 0.001603487
```

As another example of random sampling, we now consider uniform sampling.

```
<- sketch(fullsample, m_ols, method = "unif")
subsample <- subsample[,1]
ys <- subsample[,-1]
reg <- lm(ys ~ reg - 1)
submodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(submodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.078158594 0.001467223
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(submodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.078158594 0.001630221
```

We now move to random projection schemes. First, we consider countsketch.

```
<- sketch(fullsample, m_ols, method = "countsketch")
subsample <- subsample[,1]
ys <- subsample[,-1]
reg <- lm(ys ~ reg - 1)
submodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(submodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.07818373 0.00142449
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(submodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.07818373 0.00146307
```

Next, we consider Subsampled Randomized Hadamard Transform (SRHT).

```
<- sketch(fullsample, m_ols, method = "srht")
subsample <- subsample[,1]
ys <- subsample[,-1]
reg <- lm(ys ~ reg - 1)
submodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(submodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.077694934 0.001417144
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(submodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.077694934 0.001420585
```

For each sketching scheme, only one random sketch is drawn; hence, the results can change if we redraw sketches. Note that the intercept term is included in the full sample data before applying sketching methods. This is important for random sketching schemes as the observations across different rows are randomly combined.

Remarkably, all sketched estimates are 0.08, reproducing the full sample estimate up to the second digit. The sketched homoskedasticity-only standard errors are also very much the same across different methods. The Eicker-Huber-White standard error (i.e., heteroskedasticity-robust standard error) is a bit larger than the homoskedastic standard error with the full sample. As expected, the same pattern is observed for Bernoulli and uniform sampling, as these sampling schemes preserve conditional heteroskedasticity.

We now move to 2SLS estimation. For 2SLS, as it is more demanding to
achieve good precision, we take *τ*_{2}(∞) = 5, resulting
in *m* = 61, 132 (about 25% of *n*).

```
<- cbind(Y,intercept,X,intercept,Z)
fullsample <- nrow(fullsample)
n <- ncol(X)
p <- ncol(Z)
q # choice of m (data-oblivious sketch size)
<- 0.05
target_size <- 0.8
target_power <- (qnorm(1-target_size) + qnorm(target_power))^2
S_constant <- 5
tau_limit <- floor(n*S_constant/tau_limit^2)
m_2sls print(m_2sls)
#> [1] 61132
```

As before, we first obtain the 2SLS estimate using the full sample.

```
<- fullsample[,1]
ys <- as.matrix(fullsample[,2:(p+2)])
reg <- as.matrix(fullsample[,(p+3):ncol(fullsample)])
inst <- ivreg::ivreg(ys ~ reg - 1 | inst - 1)
fullmodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(fullmodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se print(c(est,se))
#> [1] 0.07685568 0.01504165
# use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(fullmodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc print(c(est_hc,se_hc))
#> [1] 0.07685568 0.01512252
```

The 2SLS estimate of the return to education is 0.769 and both types of standard errors are almost the same.

We now consider a variety of sketching schemes.

```
# sketching methods for 2SLS
<- c("bernoulli","unif","countsketch","srht")
methods <- array(NA, dim = c(length(methods),3))
results_2sls for (met in 1:length(methods)){
<- methods[met]
method # generate a sketch
<- sketch(fullsample, m_2sls, method = method)
subsample <- subsample[,1]
ys <- as.matrix(subsample[,2:(p+2)])
reg <- as.matrix(subsample[,(p+3):ncol(subsample)])
inst <- ivreg::ivreg(ys ~ reg - 1 | inst - 1)
submodel # use homoskedasticity-only asymptotic variance
<- lmtest::coeftest(submodel, df = Inf)
ztest <- ztest[(d+1),1]
est <- ztest[(d+1),2]
se # use heteroskedasticity-robust asymptotic variance
<- lmtest::coeftest(submodel, df = Inf,
ztest_hc vcov = sandwich::vcovHC, type = "HC0")
<- ztest_hc[(d+1),1]
est_hc <- ztest_hc[(d+1),2]
se_hc <- c(est, se, se_hc)
results_2sls[met,]
}rownames(results_2sls) <- methods
colnames(results_2sls) <- c("est", "non-robust se","robust se")
print(results_2sls)
#> est non-robust se robust se
#> bernoulli 0.08090466 0.02358229 0.02362023
#> unif 0.05960437 0.02412623 0.02467241
#> countsketch 0.10289271 0.02297033 0.02615567
#> srht 0.08491405 0.02403900 0.02412490
```

The sketched 2SLS estimates vary more than the sketched OLS estimates, reflecting that the 2SLS estimates are less precisely estimated than the OLS estimates. As in the full sample case, both types of standard errors are similar across all sketches for 2SLS.

Angrist, J. D., and A. B. Krueger (1991). “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics 106, no. 4 (1991): 979–1014. https://doi.org/10.2307/2937954.

Lee, S. and Ng, S. (2022). “Least Squares Estimation Using Sketched Data with Heteroskedastic Errors,” arXiv:2007.07781, accepted for presentation at the Thirty-ninth International Conference on Machine Learning (ICML 2022).

Lee, S. and Ng, S. (2020). “An Econometric Perspective on Algorithmic Subsampling,” Annual Review of Economics, 12(1): 45–80. https://doi.org/10.1146/annurev-economics-022720-114138