Usage of the Personalized Package

Jared Huling

2022-06-27

1 Introduction to personalized

The personalized package aims to provide an entire analysis pipeline that encompasses a broad class of statistical methods for subgroup identification / personalized medicine.

The general analysis pipeline is as follows:

  1. Construct propensity score function and check propensity score diagnostics
  2. Choose and fit a subgroup identification model
  3. Estimate the resulting treatment effects among estimated subgroups
  4. Visualize and examine model and subgroup treatment effects

The available subgroup identification models are those under the purview of the general subgroup identification framework proposed by Chen, et al. (2017). In this section we will give a brief summary of this framework and what elements of it are available in the personalized package.

In general we are interested in understanding the impact of a treatment on an outcome and in particular determining if and how different patients respond differently to a treatment in terms of their expected outcome. Assume the outcome we observe \(Y\) is such that larger values are preferable. In addition to the outcome, we also observe patient covariate information \(X \in \mathbb{R}^p\) and the treatment status \(T \in \{-1,1\}\), where \(T = 1\) indicates that a patient received the treatment, and \(T = -1\) indicates a patient received the control. For the purposes of this package, we consider an unspecified form for the expected outcome conditional on the covariate and treatment status information: \[E(Y|T, X) = g(X) + T\Delta(X)/2,\] where \(\Delta(X) \equiv E(Y|T=1, X) - E(Y|T=-1, X)\) is of primary interest and \(g(X) \equiv \frac{1}{2}\{E(Y|T=1, X) + E(Y|T=-1, X) \}\) represents covariate main effects. Here, \(\Delta(X)\) represents the interaction between treatment and covariates and thus drives heterogeneity of main effect. The purpose of the package is in estimation of \(\Delta(X)\) or monotone transformations of \(\Delta(X)\) which can be used to stratify the population into subgroups (e.g. a subgroup of patients who benefit from the treatment and a subgroup who does not benefit).

We call the term \(\Delta(X)\) a benefit score, as it reflects how much a patient is expected to benefit from a treatment in terms of their outcome. For a patient with \(X = x\), if \(\Delta(x) > 0\) (assuming larger outcomes are better), the treatment is beneficial in terms of the expected outcome, and if \(\Delta(X) \leq 0\), the control is better than the treatment. Hence to identify which subgroup of patients benefits from a treatment, we seek to estimate \(\Delta(X)\).

In the framework of Chen, et al. (2017), there are two main methods for estimating subgroups. The first is called the weighting method. The weighting method estimates \(\Delta(X)\) (or monotone transformations of it) by minimizing the following objective function with respect to \(f(X)\): \[L_W(f) = \frac{1}{n}\sum_{i = 1}^n\frac{M(Y_i, T_i\times f(x_i)) }{ {T_i\pi(x_i)+(1-T_i)/2} },\] where \(\pi(x) = Pr(T = 1|X = x)\) is the propensity score function. Here, \(\hat{f}\) is our estimated benefit score. Hence \(\hat{f} = \mbox{argmin}_f L_W(f)\) is our estimate of \(\Delta(X)\). If we want a simple functional form for the estimate \(\hat{f}\), we can restrict the form of \(f\) such that it is a linear combination of the covariates, i.e. \(f(X) = X^T\beta\). Hence \(\hat{f}(X) = X^T\hat{\beta}\).

The A-learning estimator is the minimizer of \[L_A(f) = \frac{1}{n}\sum_{i = 1}^n M(Y_i, {\{(T_i+1)/2 -\pi(x_i)\} } {\times f(x_i))}.\]

1.0.1 Choice of \(M\) function

The personalized package offers a flexible range of choices both for the form of \(f(X)\) and also for the loss function \(M(y, v)\). Most choices of \(f\) and \(M\) can be used for either the weighting method or for the A-learning method. In this package, we limit the use of \(M\) to natural choices corresponding to the type of outcome. The squared error loss \(M(y, v) = (v - y) ^ 2\) corresponds to continuous responses but can also be used for binary outcomes, however the logistic loss \(M(y, v) = y \cdot log(1 + \exp\{-v\})\) corresponds to binary outcomes and the loss associated with the negative partial likelihood of the Cox proportional hazards model corresponds to time-to-event outcomes.

Name Outcomes Loss
Squared Error C/B/CT \(M(y, v) = (v - y) ^ 2\)
OWL Logistic C/B/CT \(M(y, v) = y\log(1 + \exp\{-v\})\)
OWL Logistic Flip C/B/CT \(M(y, v) = \vert y\vert \log(1 + \exp\{-\mbox{sign}(y)v\})\)
OWL Hinge C/B/CT \(M(y, v) = y\max(0, 1 - v)\)
OWL Hinge Flip C/B/CT \(M(y, v) = \vert y\vert\max(0, 1 - \mbox{sign}(y)v)\)
Logistic B \(M(y, v) = -[yv - \log(1 + \mbox{exp}\{-v\})]\)
Poisson CT \(M(y, v) = -[yv - \exp(v)]\)
Cox TTE \(M(y, v) = -\left\{ \int_0^\tau\left( v - \log[E\{ e^vI(X \geq u) \}] \right)\mathrm{d} N(u) \right\}\)

where “C” indicates usage for continuous outcomes, “B” indicates usage for binary outcomes, “CT” indicates usages for count outcomes, and “TTE” indicates usages for time-to-event outcomes, and for the last loss \(y = (X, \delta) = \{ \widetilde{X} \wedge C, I(\widetilde{X} \leq t) \}\), \(\widetilde{X}\) is the survival time, & \(C\) is the censoring time, \(N(t) = I(\widetilde{X} \leq t)\delta\), and \(\tau\) & is fixed time point where \(P(X \geq \tau) > 0\).

1.0.2 Choice of \(f\)

The choices of \(f\) offered in the personalized package are varied. A familiar, interpretable choice of \(f(X)\) is \(X^T\beta\). Also offered is an additive model, i.e. \(f(X) = \sum_{j = 1}^pf_j(X_j)\); this option is accessed through use of the mgcv package, which provides estimation procedures for generalized additive models (GAMs). Another flexible, but less interpretable choice offered here is related to gradient boosted decision trees, which model \(f\) as \(f(X) = \sum_{k = 1}^Kf_k(X)\), where each \(f_k\) is a decision tree model.

1.0.3 Variable Selection

For subgroup identification models with \(f(X) = X^T\beta\), the personalized package also allows for variable selection. Instead of minimizing \(L_W(f)\) or \(L_A(f)\), we instead minimize a penalized version: \(L_W(f) + \lambda||\beta||_1\) or \(L_A(f) + \lambda||\beta||_1\).

1.1 Extension to multi-category treatments

Often, multiple treatment options are available for patients instead of one treatment option and a control and the researcher may wish to understand which of all treatment options are the best for which patients. Extending the above methodology to multi-category treatment results in added complications, and in particular there is no straightforward extension of the A-learning method for multiple treatment settings. In the supplementary material of , the weighting method was extended to estimate a benefit score corresponding to each level of a treatment subject to a sum-to-zero constraint for identifiability. In particular, we are interested in estimating (the sign) of \[\begin{eqnarray} \Delta_{kl}(x) \equiv \{ E(Y | T = k, { X} = { x}) - E(Y | T = l, X = { x}) \} \label{definition of Delta_kl} \end{eqnarray}\] If \(\Delta_{kl}(x) > 0\), then treatment \(k\) is preferable to treatment \(l\) for a patient with \(X = x\). For each patient, evaluation of all pairwise comparisons of the \(\Delta_{kl}(x)\) indicates which treatment leads to the largest expected outcome. The weighting estimators of the benefit scores are the minimizers of the following loss function: \[\begin{equation} \label{eqn:weighting_mult} L_W(f_1, \dots, f_{K}) = \frac{1}{n}\sum_{i = 1}^n\frac{\boldsymbol M(Y_i, \sum_{k = 1}^{K}I(T_{i} = k)\times f_k(x_i) ) }{ { Pr(T = T_i | X = x_i)} }, \end{equation}\] subject to \(\sum_{k = 1}^{K}f_k(x_i) = 0\). Clearly when \(K = 2\), this loss function is equivalent to (\(\ref{eqn:weighting}\)).

Estimation of the benefit scores in this model is still challenging without added modeling assumptions, as enforcing \(\sum_{k = 1}^{K}f_k(x_i) = 0\) may not always be feasible using existing estimation routines. However, if each \(\Delta_{kl}(X)\) has a linear form, i.e. \(\Delta_{kl}(X) = X^\top\boldsymbol \beta_k\) where \(l\) represents a reference treatment group, estimation can then easily be fit into the same computational framework as for the simpler two treatment case by constructing an appropriate design matrix. Thus, for multiple treatments the package is restricted to linear estimators of the benefit scores. For instructive purposes, consider a scenario with three treatment options, \(A\), \(B\), and \(C\). Let \(\boldsymbol X = ({\boldsymbol X}_A^\top, {\boldsymbol X}_B^\top, {\boldsymbol X}_C^\top )^\top\) be the design matrix for all patients, where each \({\boldsymbol X}_k^\top\) is the sub-design matrix of patients who received treatment \(k\). Under \(\Delta_{kl}(X) = X^\top\boldsymbol \beta_k\) with \(l\) as the reference treatment, we can construct a new design matrix which can then be provided to existing estimation routines in order to minimize (\(\ref{eqn:weighting_mult}\)). With treatment \(C\) as the reference treatment, the design matrix is constructed as \[ \widetilde{{\boldsymbol X}} = \mbox{diag}(\boldsymbol J)\begin{pmatrix} {\boldsymbol X}_A & \boldsymbol 0 \\ \boldsymbol 0 & {\boldsymbol X}_B \\ {\boldsymbol X}_C & {\boldsymbol X}_C \end{pmatrix}, \] where the \(i\)th element of \(\boldsymbol J\) is \(2I(T_i \neq C) - 1\) and the weight vector \(\boldsymbol W\) is constructed with the \(i\)th element set to \(1 / Pr(T = T_i | X = {x}_i)\). Furthermore denote \(\widetilde{\boldsymbol \beta} = (\boldsymbol \beta_A^\top, \boldsymbol \beta_B^\top)^\top\). Hence \(\widetilde{{\boldsymbol X}}^\top\widetilde{\boldsymbol \beta} = {\boldsymbol X}_A^\top\boldsymbol \beta_A + {\boldsymbol X}_B^\top\boldsymbol \beta_B - {\boldsymbol X}_C^\top(\boldsymbol \beta_A + \boldsymbol \beta_B)\), and thus the sum-to-zero constraints on the benefit scores hold by construction.

2 Quick Usage Reference

First simulate some data where we know the truth. In this simulation, the treatment assignment depends on covariates and hence we must model the propensity score \(\pi(x) = Pr(T = 1 | X = x)\). In this simulation we will assume that larger values of the outcome are better.

library(personalized)

set.seed(123)
n.obs  <- 1000
n.vars <- 25
x <- matrix(rnorm(n.obs * n.vars, sd = 3), n.obs, n.vars)

# simulate non-randomized treatment
xbetat   <- 0.5 + 0.25 * x[,11] - 0.25 * x[,2]
trt.prob <- exp(xbetat) / (1 + exp(xbetat))
trt      <- rbinom(n.obs, 1, prob = trt.prob)

# simulate delta
delta <- (0.5 + x[,2] - 0.5 * x[,3] - 1 * x[,11] + 1 * x[,1] * x[,12] )

# simulate main effects g(X)
xbeta <- x[,1] + x[,11] - 2 * x[,12]^2 + x[,13] + 0.5 * x[,15] ^ 2
xbeta <- xbeta + delta * (2 * trt - 1)

# simulate continuous outcomes
y <- drop(xbeta) + rnorm(n.obs)

2.1 Creating and Checking Propensity Score Model

The first step in our analysis is to construct a model for the propensity score. In the personalized package, we need to wrap this model in a function which inputs covariate values and the treatment statuses and outputs a propensity score between 0 and 1. Since there are many covariates, we use the lasso to select variables in our propensity score model:

# create function for fitting propensity score model
prop.func <- function(x, trt)
{
 # fit propensity score model
 propens.model <- cv.glmnet(y = trt,
                            x = x, 
                            family = "binomial")
 pi.x <- predict(propens.model, s = "lambda.min",
                 newx = x, type = "response")[,1]
 pi.x
}

We then need to make sure the propensity scores have sufficient overlap between treatment groups. We can do this with the check.overlap() function, which plots densities or histograms of the propensity scores for each of the treatment groups:

check.overlap(x, trt, prop.func)

We can see that the propensity scores mostly have common support except a small region near 0 where there are no propensity scores for the treatment arm.

2.2 Fitting Subgroup Identification Models

The next step is to choose and fit a subgroup identification model. In this example, the outcome is continuous, so we choose the squared error loss function. We also choose the model type (either the weighting or the A-learning method). The main function for fitting subgroup identification models is fit.subgroup. Since there are many covariates, we choose a loss function with a lasso penalty to select variables. The underlying fitting function here is cv.glmnet(). We can pass to fit.subgroup() arguments of the cv.glmnet() function, such as nfolds for the number of cross validation folds.

subgrp.model <- fit.subgroup(x = x, y = y,
                             trt = trt,
                             propensity.func = prop.func,
                             loss   = "sq_loss_lasso",
                             nfolds = 5)              # option for cv.glmnet

summary(subgrp.model)
## family:    gaussian 
## loss:      sq_loss_lasso 
## method:    weighting 
## cutpoint:  0 
## propensity 
## function:  propensity.func 
## 
## benefit score: f(x), 
## Trt recom = 1*I(f(x)>c)+0*I(f(x)<=c) where c is 'cutpoint'
## 
## Average Outcomes:
##                 Recommended 0      Recommended 1
## Received 0   -7.792 (n = 177) -19.5361 (n = 224)
## Received 1 -15.8839 (n = 437) -10.8195 (n = 162)
## 
## Treatment effects conditional on subgroups:
## Est of E[Y|T=0,Recom=0]-E[Y|T=/=0,Recom=0] 
##                           8.0919 (n = 614) 
## Est of E[Y|T=1,Recom=1]-E[Y|T=/=1,Recom=1] 
##                           8.7166 (n = 386) 
## 
## NOTE: The above average outcomes are biased estimates of
##       the expected outcomes conditional on subgroups. 
##       Use 'validate.subgroup()' to obtain unbiased estimates.
## 
## ---------------------------------------------------
## 
## Benefit score quantiles (f(X) for 1 vs 0): 
##      0%     25%     50%     75%    100% 
## -9.1348 -2.5640 -0.7204  0.8641  7.8583 
## 
## ---------------------------------------------------
## 
## Summary of individual treatment effects: 
## E[Y|T=1, X] - E[Y|T=0, X]
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -18.270  -5.128  -1.441  -1.526   1.728  15.717 
## 
## ---------------------------------------------------
## 
## 3 out of 25 interactions selected in total by the lasso (cross validation criterion).
## 
## The first estimate is the treatment main effect, which is always selected. 
## Any other variables selected represent treatment-covariate interactions.
## 
##             Trt1     V2    V11    V21
## Estimate -0.8104 0.7112 -0.353 0.2706

We can then plot the outcomes of patients in the different subgroups:

plot(subgrp.model)

Alternatively, we can create an interaction plot. This plot represents the average outcome within each subgroup broken down by treatment status. If the lines in the interaction plots cross, that indicates there is a subgroup treatment effect.

plot(subgrp.model, type = "interaction")

2.3 Evaluating Treatment Effects within Estimated Subgroups

Unfortunately, if we simply look at the average outcome within each subgroup, this will give us a biased estimate of the treatment effects within each subgroup as we have already used the data to estimate the subgroups. Instead, to get a valid estimate of the subgroup treatment effects we can use a bootstrap approach to correcting for this bias. We can alternatively repeatedly partition our data into training and testing samples. In this procedure for each replication we fit a subgroup model using the training data and then evaluate the subgroup treatment effects on the testing data. The argument B specifies the number of replications and the argument train.fraction specifies what proportion of samples are for training in the training and testing partitioning method. Normally this would be set to a very large number, such as 500 or 1000.

Both of these approaches can be carried out using the validate.subgroup() function.

validation <- validate.subgroup(subgrp.model, 
                                B = 10L,  # specify the number of replications
                                method = "training_test_replication",
                                train.fraction = 0.75)

validation
## family:  gaussian 
## loss:    sq_loss_lasso 
## method:  weighting 
## 
## validation method:  training_test_replication 
## cutpoint:           0 
## replications:       10 
## 
## benefit score: f(x), 
## Trt recom = 1*I(f(x)>c)+0*I(f(x)<=c) where c is 'cutpoint'
## 
## Average Test Set Outcomes:
##                                Recommended 0                  Recommended 1
## Received 0  -10.0324 (SE = 3.0877, n = 46.2) -15.6445 (SE = 3.9982, n = 56)
## Received 1 -15.3881 (SE = 2.3438, n = 105.8) -12.0415 (SE = 3.4534, n = 42)
## 
## Treatment effects conditional on subgroups:
## Est of E[Y|T=0,Recom=0]-E[Y|T=/=0,Recom=0] 
##              5.3557 (SE = 4.7726, n = 152) 
## Est of E[Y|T=1,Recom=1]-E[Y|T=/=1,Recom=1] 
##                3.603 (SE = 4.4692, n = 98) 
## 
## Est of 
## E[Y|Trt received = Trt recom] - E[Y|Trt received =/= Trt recom]:                     
## 4.1217 (SE = 3.2018)

We can then plot the average outcomes averaged over all replications of the training and testing partition procedure:

plot(validation)

From the above plot we can evaluate what the impact of the subgroups is. Among patients for whom the model recommends the control is more effective than the treatment, we can see that those who instead take the treatment are worse off than patients who take the control. Similarly, among patients who are recommended the treatment, patients who take the treatment are better off on average than patients who do not take the treatment.

Similarly, we can create an interaction plot of either the bootstrap bias-corrected means within the different subgroups or the average test set means within subgroups. Here, lines crossing is an indicator of differential treatment effect between the subgroups.

plot(validation, type = "interaction")

We can also compare the validation results with the results on the observed data:

plotCompare(subgrp.model, validation, type = "interaction")

No