**Summary:** “Introduction to stratamatch” covers the
basic functionality of `stratamatch`

: stratifying a data set,
assessing the quality of the match, and matching the data within strata.
This vignette features more advanced functionality of
`stratamatch`

.

This vignette contains:

- Set-up
- Splitting the pilot set
- Fitting the prognostic model
- Alternative matching schemes

We’ll start with some sample data:

```
library(stratamatch)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
<- make_sample_data(n = 5000) mydata
```

An important consideration for pilot designs like the
`stratamatch`

approach is the selection of the pilot set.
Ideally, the individuals in the pilot set should be similarto the
individuals in the treatment group, so a prognostic model built on this
pilot set will not beextrapolating heavily when estimating prognostic
scores on the analysis set. To more closely ensure that the selected
pilot set is a representative sample of controls, one easy step is to
specify a list of categorical or binary covariates and sample the pilot
set proportionally based on these covariates.

This can be done in one step using `auto_stratify`

, for
example:

```
<- auto_stratify(mydata, "treat",
a.strat1 prognosis = outcome ~ X1 + X2,
group_by_covariates = c("C1", "B2"), size = 500)
#> Constructing a pilot set by subsampling 10% of controls.
#> Subsampling while balancing on:
#> C1, B2
#> Fitting prognostic model via logistic regression: outcome ~ X1 + X2
```

Another method is to use the `split_pilot_set`

function,
which allows the user to split the pilot set (and examine the balance)
before passing the result to `auto_stratify`

to fit the
prognostic score.

```
<- split_pilot_set(mydata, "treat", group_by_covariates = c("C1", "B2"))
mysplit #> Constructing a pilot set by subsampling 10% of controls.
#> Subsampling while balancing on:
#> C1, B2
```

The result, `mysplit`

, is a list containing a
`pilot_set`

and an `analysis_set`

, in this case
partitioned while balancing on B1 and C2. At this point, we might pass
the result to `auto_stratify`

as follows:

```
<- auto_stratify(mysplit$analysis_set, "treat",
a.strat2 prognosis = outcome ~ X1 + X2,
pilot_sample = mysplit$pilot_set, size = 500)
#> Using user-specified set for prognostic score modeling.
#> Fitting prognostic model via logistic regression: outcome ~ X1 + X2
```

In this case, the pilot set splitting method is the same for
`a.strat1`

and `a.strat2`

, so each should be
qualitatively similar.

By default, `auto_stratify`

uses a logistic regression
(for binary outcomes) or a linear regression (for continuous outcomes)
to fit the prognostic model. `auto_stratify`

is built to
accomodate other prognostic score estimation approaches: rather than
letting `auto_stratify`

do the model fitting, the user can
specify a pre-fit model or a vector of prognostic scores.

A word of caution is advisable: Logistic regression is generally the
norm when fitting propensity scores (Stuart
(2010)), and most studies discussing the prognostic score have
likewise focused on linear or logistic regression (Ben B. Hansen (2008), Leacy and Stuart (2014), Aikens et al. (2020)), or less commonly the
lasso (Antonelli et al. (2018)). The
nuances of modeling and checking the fit of the prognostic score are
still understudied. In particular, prognostic models generally are fit
on only control observations, meaning that they must necessarily
extrapolate to the treatment group. Users should, as always, consider
diagnostics for their specific model, for their stratified data set (see
“Intro to statamatch”), and for their matched dataset (see, as an
introduction Stuart (2010)). Additionally,
in order to maintain the integrity of the pilot set and prevent
over-fitting, users interested in trying many modeling or matching
schemes are encouraged to define a single pilot/analysis set split
(e.g. with `split_pilot_set`

, above) and use the same pilot
set throughout their design process.

To an extent, any prognostic score stratification – even on a poor quality model – will increase the speed of matching for large datasets. In addition, if the strata are sufficiently large, the subsequent within-strata matching step may compensate for a poor-quality prognostic model. To one view, the diagnostic check that matters most is the quality of the final matching result, whereas specific prognostic modeling concerns are perhaps secondary. Nonetheless, simulation and theory results suggest that incorporating a prognostic score into a study design (in combination, generally, with a propensity score) can have favorable statistical properties, such as decreasing variance, increasing power in gamma sensitivity analyses, and decreasing dependence on the propensity score (Stuart (2010), Aikens, Greaves, and Baiocchi (2020), Antonelli et al. (2018), Ben B. Hansen (2008)). To that end, prognostic models which produce high-quality prognostic score estimates are expected to ultimately produce higher quality designs by improving the prognostic balance of the matched set.

This section contains a few examples to introduce users who may be
new to the modeling space. By no means does it begin to cover all of the
modeling possibilities, or even all of the nuances of any one model.
This is not a tutorial on predictive modeling. Users looking to become
more familiar with modeling in R more broadly may be interested in the
`caret`

package (Kuhn (2021)), which implements
support for a wide variety of predictive models.

It’s important to select a model which is appropriate to the nature of the outcome of interest. In this tutorial, our sample data has a binary outcome, so we use models appropriate to that outcome. Users with continuous outcomes should use regression models appropriate to continuous outcomes. Other types of outcomes – such as categorical – have not yet been characterized in the prognostic score literature.

The lasso is a sparsifying linear model – it is a mathematical cousin to linear regression, but it functions well when a substaintial number of the measured covariates are actually uninformative to the outcome. This may be a particularly useful and intuitive approach when there are many measured covariates which may be redundant or uninformative.

The code below uses the `glmnet`

package (Friedman, Hastie, and Tibshirani (2010)) to fit
a cross-validated lasso on the pilot set based on all the measured
covariates. In this example, since the outcome is binary, we will run a
logistic lasso. This is done by specifying the
`family = "binomial"`

argument to `cv.glmnet`

(although other modeling steps are simular for continuous outcomes.)

The code below does some preprocessing to convert the pilot set data
to the right format before passing it to `cv.glmnet`

.
`glmnet`

expects the input data to be a model matrix rather
than a data frame, and it expects outcomes (`y_pilot`

) to be
separated from the covariate data (`x_pilot`

).

```
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-3
# fit model on pilot set
<- model.matrix(outcome ~ X1 + X2 + B1 + B2 + C1,
x_pilot data = mysplit$pilot_set)
<- mysplit$pilot_set %>%
y_pilot ::select(outcome) %>%
dplyras.matrix()
<- cv.glmnet(x_pilot, y_pilot, family = "binomial") cvlasso
```

At this point, we can run diagnostics on `cvlasso`

. The Introduction to
glmnet vignette contains an accessible starting point. In this
simulated data, we happen to know that the only variable that actually
affects the outcome is `X1`

. The sparsifying model often does
a great job of picking out `X1`

as the most important
variable. We can see this by printing the coefficients:

```
coef(cvlasso)
#> 8 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 0.02870109
#> (Intercept) .
#> X1 -0.64374310
#> X2 .
#> B1 .
#> B2 .
#> C1b .
#> C1c .
```

When we are satisfied with our prognostic model, we can estimate the
scores on the analysis set with `predict`

and pass the result
to `auto_stratify`

.

```
# estimate scores on analysis set
<- model.matrix(outcome ~ X1 + X2 + B1 + B2 + C1,
x_analysis data = mysplit$analysis_set)
<- predict(cvlasso, newx = x_analysis, s = "lambda.min", type = "response")
lasso_scores
# pass the scores to auto_stratify
<- auto_stratify(data = mysplit$analysis_set,
a.strat_lasso treat = "treat",
outcome = "outcome",
prognosis = lasso_scores,
pilot_sample = mysplit$pilot_set,
size = 500)
```

An elastic net is a fairly straightforward extension of the code
above – we can use the same form of the pilot set data that we used
above. An additional task is to select the `alpha`

“mixing”
parameter determining the amount of L2-regularization (1 for lasso, 0
for ridge regression). Here, I set `alpha = 0.2`

. The
tutorial for `glmnet`

also contains some advice for selecting
the alpha parameter via cross-validation. As above, we specify a
logistic elastic net with `family = "binomial"`

, since in
this case our outcome is binary.

```
<- cv.glmnet(x_pilot, y_pilot, family = "binomial", alpha = 0.2)
cvenet
<- predict(cvenet, newx = x_analysis, s = "lambda.min", type = "response")
enet_scores
# pass the scores to auto_stratify
<- auto_stratify(data = mysplit$analysis_set,
a.strat_enet treat = "treat",
outcome = "outcome",
prognosis = enet_scores,
pilot_sample = mysplit$pilot_set,
size = 500)
```

Random forests are a popular option for both classification and
regression modeling, particularly because of their strengths in modeling
nonlinear relationships in the data. Below is an example which fits a
random forest for our binary outcome using `randomForest`

. A
note for users with binary outcomes: `randomForest`

will run
regression by default if the outcome column is numeric. To circumvent
this (e.g. for 0/1 coded data), the outcome can be cast as a factor.

```
library(randomForest)
#> randomForest 4.7-1
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:dplyr':
#>
#> combine
<- randomForest(as.factor(outcome) ~ X1 + X2 + B1 + B2, data = mysplit$pilot_set) forest
```

Random forests can be somewhat more opaque than linear models in
terms of understanding how predictions are made. A good starting point
is running `importance(forest)`

to check on which features
are weighted heavily in the model.

Below, we extract a “prognostic score” from the random forest.
Another note for users with binary outcomes: The `predict`

method for random forest classifiers outputs 0/1 predictions by default.
These will be useless for stratification. Instead, we need to specify
`type = "prob"`

in the call to predict and extract the
probabilities for the “positive” outcome class.

```
<- predict(forest,
forest_scores newdata = mysplit$analysis_set,
type = "prob")[,1]
<- auto_stratify(data = mysplit$analysis_set,
a.strat_forest treat = "treat",
outcome = "outcome",
prognosis = forest_scores,
pilot_sample = mysplit$pilot_set,
size = 500)
```

“Intro to Stratamatch” covers only the default functionality of
`stratamatch`

, which is a fixed 1:k propensity score match
within strata. This tutorial covers some alternative options. Note that
the examples that follow all require the R package `optmatch`

(Ben B. Hansen and Klopfer (2006)) to be
installed.

Users can opt to use Mahalanobis distance rather then propensity
score for the within-strata matching step by specifying the “method” to
`strata_match`

. We set `k = 2`

, so that precisely
2 control observations are matched to each treated observation.

```
<- strata_match(a.strat2, model = treat ~ X1 + X2 + B1 + B2,
mahalmatch method = "mahal", k = 2)
#> Using Mahalanobis distance: treat ~ X1 + X2 + B1 + B2
summary(mahalmatch)
#> Structure of matched sets:
#> 1:2 0:1
#> 1146 1176
#> Effective Sample Size: 1528
#> (equivalent number of matched pairs).
```

Full matching may be a particularly useful approach when the ratio of
treated to control individuals varies within strata, but the researcher
still would prefer to use as much of the data as possible. To do full
matching, set `k = "full"`

. This can be used in combination
with mahalanobis distance matching, as shown below:

```
<- strata_match(a.strat2, model = treat ~ X1 + X2 + B1 + B2,
fullmahalmatch method = "mahal", k = "full")
#> Using Mahalanobis distance: treat ~ X1 + X2 + B1 + B2
summary(fullmahalmatch)
#> Structure of matched sets:
#> 5+:1 4:1 3:1 2:1 1:1 1:2 1:3 1:4 1:5+
#> 8 6 14 42 453 132 84 70 206
#> Effective Sample Size: 1331.8
#> (equivalent number of matched pairs).
```

`stratamatch`

doesn’t natively support all possible
matching schemes. Luckily, it can be fairly straightforward to stratify
a data set with `auto_stratify`

and match the results with
other software. As an example, the code below uses the
`optmatch`

package (Ben B. Hansen and
Klopfer (2006)) to match within-strata using Mahalanobis distance
with a propensity score caliper.

```
library(optmatch)
#> Loading required package: survival
# mahalanobis distance matrix for within-strata matching
<- match_on(treat ~ X1 + X2 + B1 + B2,
mahaldist within = exactMatch(treat ~ stratum,
data = a.strat2$analysis_set),
data = a.strat2$analysis_set)
# add propensity score caliper
<- match_on(glm(treat ~ X1 + X2 + B1 + B2,
propdist family = "binomial",
data = a.strat2$analysis_set))
<- mahaldist + caliper(propdist, width = 1)
mahalcaliper
<- pairmatch(mahalcaliper, data = a.strat2$analysis_set)
mahalcaliper_match #> Warning in fullmatch.BlockedInfinitySparseMatrix(x = x, min.controls = controls, : At least one subproblem matching failed.
#> (Restrictions impossible to meet?)
#> Enter ?matchfailed for more info.
summary(mahalcaliper_match)
#> Matches made in 8 of 10 subgroups, accounting for 3691 of 4614 total observations.
#>
#>
#> Structure of matched sets:
#> 1:0 1:1 0:1
#> 192 954 2514
#> Effective Sample Size: 954
#> (equivalent number of matched pairs).
```

Aikens, Rachael C, Dylan Greaves, and Michael Baiocchi. 2020. “A
Pilot Design for Observational Studies: Using Abundant Data
Thoughtfully.” *Statistics in Medicine*.

Aikens, Rachael C, Joseph Rigdon, Justin Lee, Michael Baiocchi, Andrew B
Goldstone, Peter Chiu, Y Joseph Woo, and Jonathan H Chen. 2020.
“Stratified Pilot Matching in r: The Stratamatch Package.”
*Statistics arXiv*, January. https://arxiv.org/abs/2001.02775.

Antonelli, Joseph, Matthew Cefalu, Nathan Palmer, and Denis Agniel.
2018. “Doubly Robust Matching Estimators for High Dimensional
Confounding Adjustment.” *Biometrics* 74 (4): 1171–79.

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010.
“Regularization Paths for Generalized Linear Models via Coordinate
Descent.” *Journal of Statistical Software* 33 (1): 1.

Hansen, Ben B. 2008. “The Prognostic Analogue of the Propensity
Score.” *Biometrika* 95 (2): 481–88.

Hansen, Ben B., and Stephanie Olsen Klopfer. 2006. “Optimal Full
Matching and Related Designs via Network Flows.” *Journal of
Computational and Graphical Statistics* 15 (3): 609–27.

Kuhn, Max. 2021. *Caret: Classification and Regression Training*.
https://CRAN.R-project.org/package=caret.

Leacy, Finbarr P, and Elizabeth A Stuart. 2014. “On the Joint Use
of Propensity and Prognostic Scores in Estimation of the Average
Treatment Effect on the Treated: A Simulation Study.”
*Statistics in Medicine* 33 (20): 3488–508.

Stuart, Elizabeth A. 2010. “Matching Methods for Causal Inference:
A Review and a Look Forward.” *Statistical Science: A Review
Journal of the Institute of Mathematical Statistics* 25 (1): 1.