1 Introduction

The purpose of this vignette is to provide a 15 minute quickstart guide to the core functions of the rbmi package.

The rbmi package consists of 4 core functions (plus several helper functions) which are typically called in sequence:

draws() - fits the imputation models and stores their parameters
impute() - creates multiple imputed datasets
analyse() - analyses each of the multiple imputed datasets
pool() - combines the analysis results across imputed datasets into a single statistic

This example in this vignette makes use of Bayesian multiple imputation; this functionality requires the installation of the suggested package rstan.

install.packages("rstan")

2 The Data

We use a publicly available example dataset from an antidepressant clinical trial of an active drug versus placebo. The relevant endpoint is the Hamilton 17-item depression rating scale (HAMD17) which was assessed at baseline and at weeks 1, 2, 4, and 6. Study drug discontinuation occurred in 24% of subjects from the active drug and 26% of subjects from placebo. All data after study drug discontinuation are missing and there is a single additional intermittent missing observation.

library(rbmi)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

data("antidepressant_data")
dat <- antidepressant_data

We consider an imputation model with the mean change from baseline in the HAMD17 score as the outcome (variable CHANGE in the dataset). The following covariates are included in the imputation model: the treatment group (THERAPY), the (categorical) visit (VISIT), treatment-by-visit interactions, the baseline HAMD17 score (BASVAL), and baseline HAMD17 score-by-visit interactions. A common unstructured covariance matrix structure is assumed for both groups. The analysis model is an ANCOVA model with the treatment group as the primary factor and adjustment for the baseline HAMD17 score.

rbmi expects its input dataset to be complete; that is, there must be one row per subject for each visit. Missing outcome values should be coded as NA, while missing covariate values are not allowed. If the dataset is incomplete, then the expand_locf() helper function can be used to add any missing rows, using LOCF imputation to carry forward the observed baseline covariate values to visits with missing outcomes. Rows corresponding to missing outcomes are not present in the antidepressant trial dataset. To address this we will therefore use the expand_locf() function as follows:


# Use expand_locf to add rows corresponding to visits with missing outcomes to the dataset
dat <- expand_locf(
    dat,
    PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
    VISIT = levels(dat$VISIT),
    vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
    group = c("PATIENT"),
    order = c("PATIENT", "VISIT")
)

3 Draws

The draws() function fits the imputation models and stores the corresponding parameter estimates or Bayesian posterior parameter draws. The three main inputs to the draws() function are:

data - The primary longitudinal data.frame containing the outcome variable and all covariates.
data_ice - A data.frame which specifies the first visit affected by an intercurrent event (ICE) and the imputation strategy for handling missing outcome data after the ICE. At most one ICE which is to be imputed by a non-MAR strategy is allowed per subject.
method - The statistical method used to fit the imputation models and to create imputed datasets.

For the antidepressant trial data, the dataset data_ice is not provided. However, it can be derived because, in this dataset, the subject’s first visit affected by the ICE “study drug discontinuation” corresponds to the first terminal missing observation. We first derive the dataset data_ice and then create 150 Bayesian posterior draws of the imputation model parameters.

For this example, we assume that the imputation strategy after the ICE is Jump To Reference (JR) for all subjects and that 150 multiple imputed datasets using Bayesian posterior draws from the imputation model are to be created.

# create data_ice and set the imputation strategy to JR for
# each patient with at least one missing observation
dat_ice <- dat %>% 
    arrange(PATIENT, VISIT) %>% 
    filter(is.na(CHANGE)) %>% 
    group_by(PATIENT) %>% 
    slice(1) %>%
    ungroup() %>% 
    select(PATIENT, VISIT) %>% 
    mutate(strategy = "JR")

# In this dataset, subject 3618 has an intermittent missing values which does not correspond
# to a study drug discontinuation. We therefore remove this subject from `dat_ice`. 
# (In the later imputation step, it will automatically be imputed under the default MAR assumption.)
dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618),]

dat_ice
#> # A tibble: 43 × 3
#>    PATIENT VISIT strategy
#>    <fct>   <fct> <chr>   
#>  1 1513    5     JR      
#>  2 1514    5     JR      
#>  3 1517    5     JR      
#>  4 1804    7     JR      
#>  5 2104    7     JR      
#>  6 2118    5     JR      
#>  7 2218    6     JR      
#>  8 2230    6     JR      
#>  9 2721    5     JR      
#> 10 2729    5     JR      
#> # ℹ 33 more rows

# Define the names of key variables in our dataset and
# the covariates included in the imputation model using `set_vars()`
# Note that the covariates argument can also include interaction terms
vars <- set_vars(
    outcome = "CHANGE",
    visit = "VISIT",
    subjid = "PATIENT",
    group = "THERAPY",
    covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
)

# Define which imputation method to use (here: Bayesian multiple imputation with 150 imputed datsets)
method <- method_bayes(
    n_samples = 150,
    control = control_bayes(
        warmup = 200,
        thin = 5,
        seed = 1821 # Seed to be used by Stan
    )
)

# Create samples for the imputation parameters by running the draws() function
set.seed(987)
drawObj <- draws(
    data = dat,
    data_ice = dat_ice,
    vars = vars,
    method = method,
    quiet = TRUE
)
drawObj
#> 
#> Draws Object
#> ------------
#> Number of Samples: 150
#> Number of Failed Samples: 0
#> Model Formula: CHANGE ~ 1 + THERAPY + VISIT + BASVAL * VISIT + THERAPY * VISIT
#> Imputation Type: random
#> Method:
#>     name: Bayes
#>     covariance: us
#>     same_cov: TRUE
#>     n_samples: 150
#>     prior_cov: default
#> Controls:
#>     warmup: 200
#>     thin: 5
#>     chains: 1
#>     init: mmrm
#>     seed: 1821

Note the use of set_vars() which specifies the names of the key variables within the dataset and the imputation model. Additionally, note that whilst vars$group and vars$visit are added as terms to the imputation model by default, their interaction is not, thus the inclusion of group * visit in the list of covariates.

Available imputation methods include:

Bayesian multiple imputation - method_bayes()
Approximate Bayesian multiple imputation - method_approxbayes()
Conditional mean imputation (bootstrap) - method_condmean(type = "bootstrap")
Conditional mean imputation (jackknife) - method_condmean(type = "jackknife")
Bootstrapped multiple imputation - method = method_bmlmi()

For a comparison of these methods, we refer to the stat_specs vignette (Section 3.10).

“statistical specifications” vignette (Section 3.10): vignette("stat_specs",package="rbmi").

Available imputation strategies include:

Missing At Random - "MAR"
Jump to Reference - "JR"
Copy Reference - "CR"
Copy Increments from Reference - "CIR"
Last Mean Carried Forward - "LMCF"

4 Impute

The next step is to use the parameters from the imputation model to generate the imputed datasets. This is done via the impute() function. The function only has two key inputs: the imputation model output from draws() and the reference groups relevant to reference-based imputation methods. It’s usage is thus:

imputeObj <- impute(
    drawObj,
    references = c("DRUG" = "PLACEBO", "PLACEBO" = "PLACEBO")
)
imputeObj
#> 
#> Imputation Object
#> -----------------
#> Number of Imputed Datasets: 150
#> Fraction of Missing Data (Original Dataset):
#>     4:   0%
#>     5:   8%
#>     6:  13%
#>     7:  25%
#> References:
#>     DRUG    -> PLACEBO
#>     PLACEBO -> PLACEBO

In this instance, we are specifying that the PLACEBO group should be the reference group for itself as well as for the DRUG group (as is standard for imputation using reference-based methods).

Generally speaking, there is no need to see or directly interact with the imputed datasets. However, if you do wish to inspect them, they can be extracted from the imputation object using the extract_imputed_dfs() helper function, i.e.:

imputed_dfs <- extract_imputed_dfs(imputeObj)
head(imputed_dfs[[10]], 12) # first 12 rows of 10th imputed dataset
#>     PATIENT HAMATOTL PGIIMP RELDAYS VISIT THERAPY GENDER POOLINV BASVAL
#> 1  new_pt_1       21      2       7     4    DRUG      F     006     32
#> 2  new_pt_1       19      2      14     5    DRUG      F     006     32
#> 3  new_pt_1       21      3      28     6    DRUG      F     006     32
#> 4  new_pt_1       17      4      42     7    DRUG      F     006     32
#> 5  new_pt_2       18      3       7     4 PLACEBO      F     006     14
#> 6  new_pt_2       18      2      15     5 PLACEBO      F     006     14
#> 7  new_pt_2       14      3      29     6 PLACEBO      F     006     14
#> 8  new_pt_2        8      2      42     7 PLACEBO      F     006     14
#> 9  new_pt_3       18      3       7     4    DRUG      F     006     21
#> 10 new_pt_3       17      3      14     5    DRUG      F     006     21
#> 11 new_pt_3       12      3      28     6    DRUG      F     006     21
#> 12 new_pt_3        9      3      44     7    DRUG      F     006     21
#>    HAMDTL17 CHANGE
#> 1        21    -11
#> 2        20    -12
#> 3        19    -13
#> 4        17    -15
#> 5        11     -3
#> 6        14      0
#> 7         9     -5
#> 8         5     -9
#> 9        20     -1
#> 10       18     -3
#> 11       16     -5
#> 12       13     -8

Note that in the case of method_bayes() or method_approxbayes(), all imputed datasets correspond to random imputations on the original dataset. For method_condmean(), the first imputed dataset will always correspond to the completed original dataset containing all subjects. For method_condmean(type="jackknife"), the remaining datasets correspond to conditional mean imputations on leave-one-subject-out datasets, whereas for method_condmean(type="bootstrap"), each subsequent dataset corresponds to a conditional mean imputation on a bootstrapped datasets. For method_bmlmi(), all the imputed datasets correspond to sets of random imputations on bootstrapped datasets.

5 Analyse

The next step is to run the analysis model on each imputed dataset. This is done by defining an analysis function and then calling analyse() to apply this function to each imputed dataset. For this vignette we use the ancova() function provided by the rbmi package which fits a separate ANCOVA model for the outcomes from each visit and returns a treatment effect estimate and corresponding least square means for each group per visit.

anaObj <- analyse(
    imputeObj,
    ancova,
    vars = set_vars(
        subjid = "PATIENT",
        outcome = "CHANGE",
        visit = "VISIT",
        group = "THERAPY",
        covariates = c("BASVAL")
    )
)
anaObj
#> 
#> Analysis Object
#> ---------------
#> Number of Results: 150
#> Analysis Function: ancova
#> Delta Applied: FALSE
#> Analysis Estimates:
#>     trt_4
#>     lsm_ref_4
#>     lsm_alt_4
#>     trt_5
#>     lsm_ref_5
#>     lsm_alt_5
#>     trt_6
#>     lsm_ref_6
#>     lsm_alt_6
#>     trt_7
#>     lsm_ref_7
#>     lsm_alt_7

Note that, similar to draws(), the ancova() function uses the set_vars() function which determines the names of the key variables within the data and the covariates (in addition to the treatment group) for which the analysis model will be adjusted.

Please also note that the names of the analysis estimates contain ref and alt to refer to the two treatment arms. In particular ref refers to the first factor level of vars$group which does not necessarily coincide with the control arm. In this example, since levels(dat[[vars$group]]) = c("DRUG", PLACEBO), the results associated with ref correspond to the intervention arm, while those associated with alt correspond to the control arm.

Additionally, we can use the delta argument of analyse() to perform a delta adjustments of the imputed datasets prior to the analysis. In brief, this is implemented by specifying a data.frame that contains the amount of adjustment to be added to each longitudinal outcome for each subject and visit, i.e. the data.frame must contain the columns subjid, visit, and delta.

It is appreciated that carrying out this procedure is potentially tedious, therefore the delta_template() helper function has been provided to simplify it. In particular, delta_template() returns a shell data.frame where the delta-adjustment is set to 0 for all patients. Additionally delta_template() adds several meta-variables onto the shell data.frame which can be used for manual derivation or manipulation of the delta-adjustment.

For example lets say we want to add a delta-value of 5 to all imputed values (i.e. those values which were missing in the original dataset) in the drug arm. That could then be implemented as follows:

# For reference show the additional meta variables provided
delta_template(imputeObj) %>% as_tibble()
#> # A tibble: 688 × 8
#>    PATIENT VISIT THERAPY is_mar is_missing is_post_ice strategy delta
#>    <fct>   <fct> <fct>   <lgl>  <lgl>      <lgl>       <chr>    <dbl>
#>  1 1503    4     DRUG    TRUE   FALSE      FALSE       <NA>         0
#>  2 1503    5     DRUG    TRUE   FALSE      FALSE       <NA>         0
#>  3 1503    6     DRUG    TRUE   FALSE      FALSE       <NA>         0
#>  4 1503    7     DRUG    TRUE   FALSE      FALSE       <NA>         0
#>  5 1507    4     PLACEBO TRUE   FALSE      FALSE       <NA>         0
#>  6 1507    5     PLACEBO TRUE   FALSE      FALSE       <NA>         0
#>  7 1507    6     PLACEBO TRUE   FALSE      FALSE       <NA>         0
#>  8 1507    7     PLACEBO TRUE   FALSE      FALSE       <NA>         0
#>  9 1509    4     DRUG    TRUE   FALSE      FALSE       <NA>         0
#> 10 1509    5     DRUG    TRUE   FALSE      FALSE       <NA>         0
#> # ℹ 678 more rows

delta_df <- delta_template(imputeObj) %>%
    as_tibble() %>% 
    mutate(delta = if_else(THERAPY == "DRUG" & is_missing , 5, 0)) %>% 
    select(PATIENT, VISIT, delta)
    
delta_df
#> # A tibble: 688 × 3
#>    PATIENT VISIT delta
#>    <fct>   <fct> <dbl>
#>  1 1503    4         0
#>  2 1503    5         0
#>  3 1503    6         0
#>  4 1503    7         0
#>  5 1507    4         0
#>  6 1507    5         0
#>  7 1507    6         0
#>  8 1507    7         0
#>  9 1509    4         0
#> 10 1509    5         0
#> # ℹ 678 more rows

anaObj_delta <- analyse(
    imputeObj,
    ancova,
    delta = delta_df,
    vars = set_vars(
        subjid = "PATIENT",
        outcome = "CHANGE",
        visit = "VISIT",
        group = "THERAPY",
        covariates = c("BASVAL")
    )
)

6 Pool

Finally, the pool() function can be used to summarise the analysis results across multiple imputed datasets to provide an overall statistic with a standard error, confidence intervals and a p-value for the hypothesis test of the null hypothesis that the effect is equal to 0.

Note that the pooling method is automatically derived based on the method that was specified in the original call to draws():

For method_bayes() or method_approxbayes() pooling and inference are based on Rubin’s rules.
For method_condmean(type = "bootstrap") inference is either based on a normal approximation using the bootstrap standard error (pool(..., type = "normal")) or on the bootstrap percentiles (pool(..., type = "percentile")).
For method_condmean(type = "jackknife") inference is based on a normal approximation using the jackknife estimate of the standard error.
For method = method_bmlmi() inference is according to the methods described by von Hippel and Bartlett (see the stat_specs vignette for details)

Since we have used Bayesian multiple imputation in this vignette, the pool() function will automatically use Rubin’s rules.

poolObj <- pool(
    anaObj, 
    conf.level = 0.95, 
    alternative = "two.sided"
)
poolObj
#> 
#> Pool Object
#> -----------
#> Number of Results Combined: 150
#> Method: rubin
#> Confidence Level: 0.95
#> Alternative: two.sided
#> 
#> Results:
#> 
#>   ==================================================
#>    parameter   est     se     lci     uci     pval  
#>   --------------------------------------------------
#>      trt_4    -0.092  0.683  -1.439  1.256   0.893  
#>    lsm_ref_4  -1.616  0.486  -2.576  -0.656  0.001  
#>    lsm_alt_4  -1.708  0.475  -2.645  -0.77   <0.001 
#>      trt_5     1.33   0.924  -0.494  3.155   0.152  
#>    lsm_ref_5  -4.15   0.659  -5.452  -2.849  <0.001 
#>    lsm_alt_5  -2.82   0.646  -4.097  -1.543  <0.001 
#>      trt_6    1.953   0.998  -0.018  3.924   0.052  
#>    lsm_ref_6  -6.09   0.715  -7.503  -4.677  <0.001 
#>    lsm_alt_6  -4.137  0.701  -5.521  -2.752  <0.001 
#>      trt_7    2.166   1.118  -0.045  4.376   0.055  
#>    lsm_ref_7  -6.974  0.821  -8.598  -5.351  <0.001 
#>    lsm_alt_7  -4.809  0.781  -6.351  -3.266  <0.001 
#>   --------------------------------------------------

The table of values shown in the print message for poolObj can also be extracted using the as.data.frame() function:

as.data.frame(poolObj)
#>    parameter         est        se         lci        uci         pval
#> 1      trt_4 -0.09180645 0.6826279 -1.43949684  1.2558839 8.931772e-01
#> 2  lsm_ref_4 -1.61581996 0.4862316 -2.57577141 -0.6558685 1.093708e-03
#> 3  lsm_alt_4 -1.70762640 0.4749573 -2.64531931 -0.7699335 4.262148e-04
#> 4      trt_5  1.33036878 0.9239311 -0.49439426  3.1551318 1.518620e-01
#> 5  lsm_ref_5 -4.15040536 0.6590542 -5.45206801 -2.8487427 2.847887e-09
#> 6  lsm_alt_5 -2.82003658 0.6464215 -4.09683850 -1.5432347 2.319168e-05
#> 7      trt_6  1.95324204 0.9976499 -0.01761942  3.9241035 5.205766e-02
#> 8  lsm_ref_6 -6.08995154 0.7153681 -7.50333618 -4.6765669 1.545921e-14
#> 9  lsm_alt_6 -4.13670950 0.7007487 -5.52128259 -2.7521364 2.275505e-08
#> 10     trt_7  2.16554484 1.1184001 -0.04484704  4.3759367 5.476906e-02
#> 11 lsm_ref_7 -6.97430508 0.8210463 -8.59804208 -5.3505681 3.164128e-14
#> 12 lsm_alt_7 -4.80876024 0.7805370 -6.35149800 -3.2660225 6.806194e-09

These outputs gives an estimated difference of 2.166 (95% CI -0.045 to 4.376) between the two groups at the last visit with an associated p-value of 0.055.

7 Code

We report below all the code presented in this vignette.

library(rbmi)
library(dplyr)

data("antidepressant_data")
dat <- antidepressant_data

# Use expand_locf to add rows corresponding to visits with missing outcomes to the dataset
dat <- expand_locf(
    dat,
    PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
    VISIT = levels(dat$VISIT),
    vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
    group = c("PATIENT"),
    order = c("PATIENT", "VISIT")
)

# Create data_ice and set the imputation strategy to JR for
# each patient with at least one missing observation
dat_ice <- dat %>% 
    arrange(PATIENT, VISIT) %>% 
    filter(is.na(CHANGE)) %>% 
    group_by(PATIENT) %>% 
    slice(1) %>%
    ungroup() %>% 
    select(PATIENT, VISIT) %>% 
    mutate(strategy = "JR")

# In this dataset, subject 3618 has an intermittent missing values which does not correspond
# to a study drug discontinuation. We therefore remove this subject from `dat_ice`. 
# (In the later imputation step, it will automatically be imputed under the default MAR assumption.)
dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618),]

# Define the names of key variables in our dataset using `set_vars()`
# and the covariates included in the imputation model
# Note that the covariates argument can also include interaction terms
vars <- set_vars(
    outcome = "CHANGE",
    visit = "VISIT",
    subjid = "PATIENT",
    group = "THERAPY",
    covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
)

# Define which imputation method to use (here: Bayesian multiple imputation with 150 imputed datsets)
method <- method_bayes(
    n_samples = 150,
    control = control_bayes(
        warmup = 200,
        thin = 5,
        seed = 1821 # Seed to be used by Stan
    )
)

# Create samples for the imputation parameters by running the draws() function
set.seed(987)
drawObj <- draws(
    data = dat,
    data_ice = dat_ice,
    vars = vars,
    method = method,
    quiet = TRUE
)

# Impute the data
imputeObj <- impute(
    drawObj,
    references = c("DRUG" = "PLACEBO", "PLACEBO" = "PLACEBO")
)

# Fit the analysis model on each imputed dataset
anaObj <- analyse(
    imputeObj,
    ancova,
    vars = set_vars(
        subjid = "PATIENT",
        outcome = "CHANGE",
        visit = "VISIT",
        group = "THERAPY",
        covariates = c("BASVAL")
    )
)

# Apply a delta adjustment

# Add a delta-value of 5 to all imputed values (i.e. those values
# which were missing in the original dataset) in the drug arm.
delta_df <- delta_template(imputeObj) %>%
    as_tibble() %>% 
    mutate(delta = if_else(THERAPY == "DRUG" & is_missing , 5, 0)) %>% 
    select(PATIENT, VISIT, delta)

# Repeat the analyses with the adjusted values
anaObj_delta <- analyse(
    imputeObj,
    ancova,
    delta = delta_df,
    vars = set_vars(
        subjid = "PATIENT",
        outcome = "CHANGE",
        visit = "VISIT",
        group = "THERAPY",
        covariates = c("BASVAL")
    )
)

# Pool the results
poolObj <- pool(
    anaObj,
    conf.level = 0.95,
    alternative = "two.sided"
)

rbmi: Quickstart

Craig Gower-Page, Alessandro Noci, and Marcel Wolbers