SIMICO Vignette

Jaihee Choi

2022-11-25

Introduction

(S)et-based (I)nference for (M)ultiple (I)nterval-(C)ensored (O)utcomes, or SIMICO, implements the multiple outcome test and multiple burden test specified in “Variance-components Tests for Genetic Association with Multiple Interval-censored Outcomes”.

Users can test a set of genetic variants with multiple interval-censored outcomes. Interval-censored outcomes arises when the exact time of onset of an event of interest is unknown but is known to fall between two time points. This type of data occurs in many large-scale genetic compendiums, such as the UK Biobank. Due to the large scale nature of these compendiums, one way that information is collected is via periodic questionnaires, causing much of the data to be interval-censored.

The steps to implement the functions, the specified form of the data inputs, and a worked example is shown below.

Data form

The data inputs for the SIMICO functions follows a specific form. All the left and right design matrices and interval times must be put in a list item indexed by the outcome number. Some code that achieves this data format is:

For the left and right design matrices (these are the matrices containing the covariate information with the cubic spline terms. The cubic spline terms can be generated by calling ICSKAT::make_IC_dmat())

dmats <- list(right_dmat = [right design matrix], left_dmat = [left design matrix])
allData <- list(xDats = list(dmats))
length(xDats)

should equal the number of outcomes

attributes(xAll$xDats[[1]]$dmats)}

should be “right_dmat” “left_dmat”.

Additionally, the data input list should also contain the left and right censoring information:
tpos_all and obs_all are both n x k matrices containing binary 0/1 for whether the individual is left or right censored for that outcome, respectively. If only the matrices of the left and right time (named leftTimesMat and rightTimesMat in the below example) are available, then one can convert these values to censoring terms by the following, for k outcomes:

obs_all <- matrix(NA, nrow= n, ncol = k)
tpos_all <- matrix(NA, nrow= n, ncol = k)

# Loop through all the outcomes
for(i in 1:k){
    obs_all[, i] <- ifelse(rightTimesMat[, i] == Inf, 0, 1)
    tpos_all[, i] <- ifelse(leftTimesMat[, i] == 0, 0, 1)
}

All the data should be together in one list with the following form:

allData <- list(xDats = xDats, ts_all = tpos_all, ob_all = obs_all)

Accessing the right design matrix for ONE outcome should be:

allData$xDats[[OUTCOME NUMBER]]$dmats$right_dmat

Accessing the n x 1 vector of whether the observations were left-censored or not should be:

allData$obs_all[,OUTCOME NUMBER]

Instructions for SIMICO Functions

There are three steps to running the multiple outcome test with this package:

Step 1:

Generate the left and right time intervals , as well as the left and right design matrices for the covariates. This can all be done through the simico_gen_dat() function, where the inputs are:
- bhFunInv is the inverse of the baseline hazard function.
- obsTimes is the vector of the intended observation times.
- windowHalf is the amount of time before or after the intended obsTimes that a visit might take place.
- n is the total number of observations.
- p is the total number of covariates.
- k is the total number of outcomes.
- tausq is the variance of the subject specific random effect.
- gMatCausal is the matrix of subsetted genetic information for only a select causal SNPs.
- xMat is the matrix of covariates.
- effectSizes is the vector of genetic effect sizes. Should be entered as a vector the same length as the number of outcomes.

Step 2:

Fit the null model. You only need to do this once for all SNP-sets to be tested. The call is simico_fit_null(), and you need the following arguments:
-init_beta is a vector holding the initial guess at the covariates for all of the outcomes plus the variance of the subject specific random effect. The number of elements should be equal to the number of columns in the design matrices times the number of outcomes plus 1 (for the subject specific random effect). Usually this can be initialized to a vector of 0s or 1s. If you happen to have a good idea of what the coefficients are, then this will speed up convergence.
- epsilon is the stopping criterion for NR.
- xDats is the list of left and right design matrices. The form is the same as specified earlier.
- lt_all is the n x k matrix of left times.
- rt_all is the n x k matrix of right times.
- k is the total number of outcomes.
- d is the total number of quadrature nodes.

Step 3

Call simico_out() for each set of SNPs that you want to test. The G argument should be the n x q matrix of genotypes. a1 and a2 are the shape parameters for the Beta distribution, which are used to weight certain variants over others by their MAF. You will also need xDats, lt_all, rt_all, k, and d, which have been defined above. Finally, you will need null_beta and Itt, which come directly from simico_fit_null(). A worked example is below.

Worked Example:

Suppose we are interested in testing whether a specific gene is associated with two different correlated outcomes: time until a fall and time until a fracture. We will simulate event times for 5,000 subjects under a proportional hazards model with baseline cumulative hazard function H(t)=t. We will set four observations times at times 1, 2, 3, and 4. Each subject will have a 10% chance of missing any given visit. The genetic data set will consist of 50 SNPs in the gene of interest. We will use 100 quadrature nodes. The subject specific random effect \(b_i\) will be drawn from a N(0,1) distribution. The genetic matrices were generated to have mean parameters uniformly distributed between (0.01, 0.05) and common pairwise correlation of 0.1. To simulate the data, we specifiy 5 causal variants with a MAF cutoff of 0.05. We simulated two fixed covariates–one generated from a N(0,2) and the other from a Binomaial(0.5). For the genetic effect sizes we chose (.03, .15).

Example with two outcomes

library(SIMICO)

# Set two outcomes
k = 2

# Set number of observations
n = 5000

# Set number of covariates
p = 2

# Set number of SNPs
q = 50

# Set number of quadrature nodes
d = 100

# Variance of subject-specific random effect
tauSq = 1

# Pairwise correlation
rho = 0.1

# Minor Allele Frequencey Cutoff
Causal.MAF.Cutoff = .05

# Total number of causal variants
num = 5

# Define the effect sizes
effectSizes <- c(.03, .15)

# the baseline cumulative hazard function
bhFunInv <- function(x) {x}

set.seed(1)
# Fixed effects
xMat <- cbind(rnorm(n, mean = 0, sd = 2), rbinom(n=n, size=1, prob=0.5))

# Genetic effects
gMat <- sim_gmat(n, q, rho)

# Get indices to specific select causal variants
idx <- Get_CausalSNPs_bynum(gMat, num, Causal.MAF.Cutoff)

# Subset the gMat
gMatCausal <- gMat[,idx]

# True model has nothing
fixedMat <- matrix(data=0, nrow=n, ncol=k)

# Generate the multiple outcomes
exampleDat <- simico_gen_dat(bhFunInv = bhFunInv, obsTimes = 1:3,
                             windowHalf = 0.1, n = n, p = p, k = k, 
                             tauSq = tauSq, gMatCausal = gMatCausal,
                             xMat = xMat, effectSizes = effectSizes)

# Set the initial estimate values
init_beta <-c (rep(c(0, 0, 0, 1, 0), k), 1)

# Run the newton-raphson
nullFit <- simico_fit_null(init_beta = init_beta, epsilon = 10^-5, 
                           xDats = exampleDat$fullDat$xDats, 
                           lt_all = exampleDat$leftTimesMat, rt_all = exampleDat$rightTimesMat, 
                           k = k, d = d)

# Get the test statistics p-values
out <- simico_out(nullFit = nullFit$beta_fit, xDats = exampleDat$fullDat$xDats, 
                  lt_all = exampleDat$leftTimesMat, rt_all = exampleDat$rightTimesMat, 
                  Itt = nullFit$jmat, a1 = 1, a2 = 25, 
                  G = gMat, k  = k, d = d)

#  results
# Score statistic
(out$multQ)
##         [,1]
## [1,] 3641276
# P-values
(out$multP)
## [1] 0.0004766614