LUCIDus: Latent Unknown Clustering with Integrated Data

Yinqi Zhao

2022-04-05

About LUCIDus Package

The LUCIDus package is aiming to provide researchers in the genetic epidemiology community with an integrative tool in R to conduct integrated clustering by using exposures, omics data and phenotypic traits. It implements the statistical method proposed in the research paper “A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits1” published by the Bioinformatics.

Introduction to the LUCID (Latent Unknown Clustering with Integrated Data)

Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from LUCID paper)

Let \(\mathbf{G}\) be a \(n \times p\) matrix with columns representing genetic features/environmental exposures, and rows being the observations; \(\mathbf{Z}\) be a \(n \times m\) matrix of standardized biomarkers and \(\mathbf{Y}\) be a \(n\)-dimensional vector of disease outcome. By the DAG graph, it is further assumed that all three components above are linked by a categorical latent cluster variable \(\mathbf{X}\) of \(K\) classes and with the conditional independence implied by the DAG, the general joint likelihood of the LUCID model can be formalized into \[\begin{equation} \begin{aligned} l(\mathbf{\Theta}) & = \sum_{i = 1}^n\log f(\mathbf{Z}_i, Y_i|\mathbf{G_i}; \mathbf{\Theta}) \\ & = \sum_{i = 1}^n \log \sum_{j = 1}^K f(\mathbf{Z}_i|X_i = j; \mathbf{\Theta}_j) f(Y_i|X_i = j; \mathbf{\Theta}_j) f(X_i = j|\mathbf{G}_i; \mathbf{\Theta}_j) \end{aligned} \end{equation}\] where \(\mathbf{\Theta}\) is a generic notation standing for parameters associated with each probability model. Additionally, we assume \(\mathbf{X}\) follows a multinomial distribution conditioning on \(\mathbf{G}\), \(\mathbf{Z}\) follows a multivariate normal distribution conditioning on \(\mathbf{X}\) and \(\mathbf{Y}\) follows a normal/Bernoulli (depending on the specific data structure of disease outcome) distribution conditioning on \(\mathbf{X}\). Therefore, the equation above can be finalized as \[\begin{equation} \begin{aligned} l(\mathbf{\Theta}) = \sum_{i = 1}^n \log \sum_{j = 1}^k S(\mathbf{G}_i; \boldsymbol{\beta}_j) \phi(\mathbf{Z}_i; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j) \end{aligned} \end{equation}\] where \(S\) denotes the softmax function and \(\phi\) denotes the probability density function (pdf) of the multivariate normal distribution.

To obtain the maximum likelihood estimates (MLE) of the model parameters, an EM algorithm is applied to handle the latent variable \(\mathbf{X}\). Denote the observed data as \(\mathbf{D}\), then the posterior probability of observation \(i\) being assigned to latent cluster \(j\) is expressed as \[\begin{equation} \begin{aligned} r_{ij} & = P(X_i = j|\mathbf{D}, \mathbf{\Theta}) \\ & = \frac{S(\mathbf{G}_i; \boldsymbol{\beta}_j) \phi(\mathbf{Z}_i; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j)}{\sum_{j = 1}^k S(\mathbf{G}_i; \boldsymbol{\beta}_j) \phi(\mathbf{Z}_i; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j)} \end{aligned} \end{equation}\] and the expectation of the complete log likelihood can be written as \[\begin{equation} \begin{aligned} Q(\mathbf{\Theta}) = \sum_{i = 1}^n\sum_{j = 1}^k r_{ij}\log\frac{S(\mathbf{G}_i; \boldsymbol{\beta}_j)}{r_{ij}} + \sum_{i = 1}^n\sum_{j = 1}^k r_{ij}\log\frac{\phi(\mathbf{Z}_i; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{r_{ij}} + \sum_{i = 1}^n\sum_{j = 1}^k r_{ij}\log\frac{f(Y_i; \boldsymbol{\Theta}_j)}{r_{ij}} \end{aligned} \end{equation}\] At each iteration, in the E-step, we compute \(r_{ij}\); in the M-step, parameters are updated by maximizing the expected complete likelihood function \(Q(\mathbf{\Theta})\). Details of EM algorithm for LUDID can be found in Peng (2019).

Main Functions

Function Description
est.lucid() Fit LUCID model to estimate latent clusters using multi-omics data with/without the outcome of interest. It features variable selection in exposure and omics data, covariance matrices for omics data with various geometric features, and integrated imputation of missing values in omics data
lucid() A wrapper function to fit LUCID model across a grid of tuning parameters and select the optimal fucntion
boot.lucid() Inference on parameters of LUCID model based on bootstrap resampling method
summary_lucid() Summarize the results of LUCID model produced est.lucid(). It presents the results in a table with detailed explanation for parameters
plot_lucid() Use a Sankey diagram to visualize the LUCID model. Users can specify colors for each component in Sankey diagram

Fit LUCID model

We use est.lucid() function to fit LUCID model.

library(LUCIDus)
# use simulated data
G <- sim_data$G
Z <- sim_data$Z
Y_normal <- sim_data$Y_normal
Y_binary <- sim_data$Y_binary
cov <- sim_data$Covariate

# fit LUCID model with continuous outcome
fit1 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008)


# fit LUCID model with binary outcome
fit2 <- est.lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2, seed = 1008)

# fit LUCID model with covariates
fit3 <- est.lucid(G = G, Z = Z, Y = Y_binary, CoY = cov, family = "binary", K = 2, seed = 1008)

User should be aware of option useY. By default, useY = TRUE, which means we fit LUCID model in a supervised fashion and incorporate information of outcome to define clusters. On the other hand, by setting useY = FALSE, we are fitting LUCID model in an unsupervised fashion, only using information from exposure and omics data to define clusters.

# fit LUCID model without useing information from outcome
fit4 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, useY = FALSE, seed = 1008)

For LUCID model, we use mclust method to optimize likelihood related to omcis data with respect to mean and covariance structure. By default, a VVV model is used to estimate covariance structure for omics data, which means the volume, shape and orientation of covariance matrices varies across different clusters. User can set modelName = NULL to let est.lucid select the optimal covariance model, or specify a specific covariance model for LUCID. Details of available models can be found in mclust::mclustModelNames under multivariate mixture section.

# fit LUCID model with automatic selection on optimal covariance models
fit5 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, modelName = NULL, seed = 1008)
# check the optimal model
fit5$modelName

# fit LUCID model with a specified covariance model
fit6 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, modelName = "EII", seed = 1008)

In terms of initialization of EM algorithm, est.lucid uses either random initialization based on uniform distribution or a heuristic method based on mclust. By default, initialization of EM algorithm uses mclsut. However, if it takes long time, we recommend user turn to random initialization by setting init_par = "random.

# initialize EM algorithm by mclust
fit7 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, init_par = "mclust" , seed = 1008)

# initialize EM algorithm via randomization
fit8 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, init_par = "random" , seed = 1008)

To ensure replicable results, user can set random seed by using the option seed. The LUCID model can be summarized via summary_lucid().

# summarize lucid results
summary_lucid(fit1)

Visualization of LUCID model

We use a Sankey diagram to visualize LUCID model. In the Sankey diagram, each node represents a variable in LUCID model (exposure, omics data and outcome), each line corresponds to an association between two variables. The color of line indicates the direction of association (by default, dark blue refers to positive association while light blue refers to negative association) and the width of line indicates the magnitude of association (large effect corresponds to wider line).

# visualze lucid model via a Snakey diagram
plot_lucid(fit1)
# change node color
plot_lucid(fit1, G_color = "yellow")
plot_lucid(fit1, Z_color = "red")

# change link color
plot_lucid(fit1, pos_link_color = "red", neg_link_color = "green")

LUCID incorporating missing omics data

The latest version of LUCID allows missingness in omics data. We consider 2 missing patterns in omics data: (1) list-wise missing pattern that only a subset of observations have measured omics data and (2) sporadic missing pattern that missingness is completely at random. We implement a likelihood partition for (1) and an integrated imputation based EM algorithm for (2).

# fit LUCID model with block-wise missing pattern in omics data
Z_miss_1 <- Z
Z_miss_1[sample(1:nrow(Z), 0.3 * nrow(Z)), ] <- NA
fit9 <- est.lucid(G = G, Z = Z_miss_1, Y = Y_normal, family = "normal", K = 2)

# fit LUCID model with sporadic missing pattern in omics data
Z_miss_2 <- Z
index <- arrayInd(sample(length(Z_miss_2), 0.3 * length(Z_miss_2)), dim(Z_miss_2))
Z_miss_2[index] <- NA
fit10 <- est.lucid(G = G, Z = Z_miss_2, Y = Y_normal, family = "normal", K = 2, seed = 1008) 

For sporadic missing pattern, est.lucid uses mclust::imputeData to initialize the imputation. The imputed omics data is also returned by est.lucid.

# imputed omics dataset
fit10$Z

Variable selection

In many situations, using all variables for clustering analysis is not an optimal option. For example, some variables may not possess any information related to cluster structure. Including these non-informative variables only add ”noises” to cluster assignment and unnecessarily increases the model complexity. For situations of moderate or low dimensionality, variable selection is able to facilitate model interpretation and identify the correct cluster structure. The integrated variable selection for LUCID is built upon penalized EM algorithm. We adapt the penalty function proposed by Zhou et al. (Electronic journal of statistics, 2009) \[\begin{equation} p_{\lambda}(\Theta) = \lambda_{ \mu}\sum_{j=1}^k\sum_{l=1}^m|\mu_{jl}| + \lambda_{ W}\sum_{j=1}^k \sum_{l,o}|w_{j;l,o}| \end{equation}\] where \(W = \Sigma^{-1}\).

# use LUCID model to conduct integrated variable selection
# select exposure
fit6 <- est.lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", 
K = 2, seed = 1008, Rho_G = 0.1)
# select omics data
fit7 <- est.lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal",
K = 2, seed = 1008, Rho_Z_Mu = 90, Rho_Z_Cov = 0.1, init_par = "random")

Model selection

For clustering analysis, usually we need to specify the number of clusters a priori. For LUCID, we use Bayesian Information Criteria (BIC) to choose the optimal number of latent clusters. The strategy is to fit LUCID model over a grid of \(K\), and choose the optimal model with lowest BIC. This can be done via the wrapper function lucid.

# tune lucid over a grid of K (note this function may take time to run)
tune_lucid <- lucid(G = G, Z = Z, Y = Y_normal, K =2:5)

Check the tuning process by

> tune_lucid$tune_list
  K Rho_G Rho_Z_Mu Rho_Z_Cov      BIC
1 2     0        0         0 46917.94
2 3     0        0         0 47716.94
3 4     0        0         0 48520.86
4 5     0        0         0 49276.93

The optimal model (K = 2) is returned as

tune_lucid$best_model

Bootstrap inference

We use a bootstrap method to conduct inference for LUCID.

# conduct bootstrap resampling
boot1 <- boot.lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100)

# use 90% CI
boot2 <- boot.lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100, conf = 0.9)

Diagnostic plot for bootstranp sampling is drawn by

# check distribution for bootstrap replicates of the variable of interest
plot(boot1$bootstrap, 1)

Acknowledgments


  1. https://doi.org/10.1093/bioinformatics/btz667↩︎

  2. Supported by the National Cancer Institute at the National Institutes of Health Grant P01 CA196569↩︎