--- title: "Quickstart: applying PCP to a simulated environmental mixture" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quickstart: applying PCP to a simulated environmental mixture} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this vignette, we will see how to leverage the functions in `pcpr` to: 1. Simulate and explore a simple environmental mixture dataset; 2. Determine the best PCP algorithm to employ with the simulated data; 3. Tune PCP's parameters using a cross-validated grid search; and 4. Evaluate PCP's performance against the simulated ground truth mixture model. We start by loading and attaching the `pcpr` package to our current R session. ```{r setup} library(pcpr) ``` Let's also write a quick helper function that will enable us to visualize the matrices we will be working with throughout this vignette as heatmaps: ```{r helper} library(magrittr) # for the pipe %>% library(ggplot2) # for plotting plot_matrix <- function(D, ..., lp = "none", title = NULL) { D <- t(D) if (is.null(colnames(D))) colnames(D) <- paste0("C", 1:ncol(D)) data.frame(D) %>% dplyr::mutate(x = paste0("R", 1:nrow(.))) %>% tidyr::pivot_longer(tidyselect::all_of(colnames(D)), names_to = "y", values_to = "value") %>% dplyr::mutate(x = factor(x, levels = unique(x)), y = factor(y, levels = unique(y))) %>% ggplot(aes(x = x, y = y, fill = value)) + geom_raster() + scale_y_discrete(limits = rev) + coord_equal() + scale_fill_viridis_c(na.value = "white", ...) + theme_minimal() + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), legend.position = lp, plot.margin = margin(0, 0, 0, 0), aspect.ratio = 1 ) + ggtitle(title) } ``` ## Simulating data The `sim_data()` function lets users generate simple mixtures models for quick experimentation. Let's use its default parameters to simulate a noisy environmental mixture $D = L_0 + S_0 + Z_0$ comprised of $n = 100$ observations of $p = 10$ chemicals, with three underlying chemical exposure patterns (or a rank $r = 3$), extreme outlying exposure events along the diagonal of the matrix, and dense Gaussian noise corrupting all the measurements in the matrix: ```{r sim data} data <- sim_data() D <- data$D L_0 <- data$L S_0 <- data$S Z_0 <- data$Z ``` Let's get a sense of what these matrices look like. Remember in real world analyses, we only observe $D$. The matrices $L_0$, $S_0$, and $Z_0$ are the ground truth, coming from some unobserved data generating mechanism: ```{r mat viz} plot_matrix(D) plot_matrix(L_0) plot_matrix(S_0) plot_matrix(Z_0) ``` The `matrix_rank()` function estimates the rank of a given matrix by counting the number of nonzero singular values governing that matrix (with the help of a `thresh` parameter determining what is "practically zero"): ```{r matrix rank} matrix_rank(L_0) matrix_rank(D) ``` Here we can see `L_0` has `r matrix_rank(L_0)` underlying patterns. This is obscured in the observed, full-rank mixture matrix `D`, since the `S_0` and `Z_0` components are corrupting the underlying structure provided by `L_0`. Let's simulate the chemicals (columns) of our data being subject to some limit of detection (LOD) with the `sim_lod()` function. We will simulate the bottom 10th percentile of each column as being below LOD, and examine the LOD for each column: ```{r lod} lod_info <- sim_lod(D, q = 0.1) D_lod <- lod_info$D_tilde lod <- lod_info$lod lod ``` Next, because our mixtures data is often incomplete in practice, let's further simulate a random 5% of the values as missing `NA` with the `sim_na()` function. Once our missing values are in place, we can finish simulating the LOD in the mixture by imputing values simulated < LOD to be the $LOD / \sqrt{2}$, a common imputation scheme for measurements < LOD: ```{r corrupt mat randomly} corrupted_data <- sim_na(D_lod, perc = 0.05) D_tilde <- corrupted_data$D_tilde lod_root2 <- matrix( lod / sqrt(2), nrow = nrow(D_tilde), ncol = ncol(D_tilde), byrow = TRUE ) lod_idxs <- which(lod_info$tilde_mask == 1) D_tilde[lod_idxs] <- lod_root2[lod_idxs] plot_matrix(D_tilde) ``` The `D_tilde` matrix represents our observed, messy mixtures model, suffering from incomplete `NA` observations and a chemical-specific LOD. ## Model selection There are two PCP algorithms shipped with `pcpr`: the convex `root_pcp()` [[Zhang et al. (2021)](https://proceedings.neurips.cc/paper/2021/hash/f65854da4622c1f1ad4ffeb361d7703c-Abstract.html)] and non-convex `rrmc()` [[Cherapanamjeri et al. (2017)](https://proceedings.mlr.press/v70/cherapanamjeri17a.html)]. To figure out which model would be best for our data, let's inspect the singular values of our observed mixture using the `sing()` method (`sing()` cannot accept missing values, so we will use `impute_matrix()` to impute `NA` values in `D_tilde` with their respective column means): ```{r sing} D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE)) singular_values <- sing(D_imputed) plot(singular_values, type = "b") ``` `rrmc()` is best suited for data characterized by slowly decaying singular values, indicative of complex underlying patterns and a relatively large degree of noise. Most EH data can be described this way. `root_pcp()` is best for data characterized by rapidly decaying singular values, indicative of very well-defined latent patterns. The singular values plotted above decay quickly from the first to the second, but very gradually from the second onward. For this simple simulated dataset, both PCP models are perfectly suitable. We will use `rrmc()`, as this is the model environmental health researchers will likely employ most frequently. The `vignette("pcp-applied")` contains an exemplary mixtures matrix singular value plot with slowly decaying singular values. ## Grid search for parameter tuning To estimate the low-rank and sparse matrices, `rrmc()` needs to be given a maximum rank `r` and regularization parameter `eta`. To determine the optimal values of `r` and `eta`, we will conduct a brief grid search using the `grid_search_cv()` function. In our grid search below, we will examine all models from ranks 1 through 5 and all values of `eta` near the default, calculated with `get_pcp_defaults()`. ```{r gs, eval = FALSE, echo = TRUE} eta_0 <- get_pcp_defaults(D_tilde)$eta etas <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2)))) # to get progress bar, could wrap this # in a call to progressr::with_progress({ gs <- grid_search_cv(...) }) gs <- grid_search_cv( D_tilde, pcp_fn = rrmc, grid = etas, r = 5, LOD = lod, parallel_strategy = "multisession", num_workers = 16, verbose = FALSE ) ``` The `rrmc()` approach to PCP uses an iterative rank-based procedure to recover `L` and `S`, meaning it first constructs a rank `1` model and iteratively builds up to the specified rank `r` solution. As such, for the above grid search, we passed `etas` as the `grid` argument to search and sent $r = 5$ as a constant parameter common to all models in the search. Since `length(etas) = 6` and $r = 5$, we searched through 30 different PCP models. The `num_runs` argument determines how many (random) tests should be performed for each unique model setting. By default, `num_runs = 100`, so our grid search tuned `r` and `eta` by measuring the performance of 3000 different PCP models. We passed the simulated `lod` vector as another constant to the grid search, equipping each `rrmc()` run with the same LOD information. ```{r real gs, eval = TRUE, echo = FALSE} gs <- readRDS("rds_files/quickstart-gs.rds") ``` ```{r gs results} r_star <- gs$summary_stats$r[1] eta_star <- round(gs$summary_stats$eta[1], 3) gs$summary_stats ``` Inspecting the `summary_stats` table from the output grid search provides the mean-aggregated statistics for each of the 30 distinct parameter settings we tested. The grid search correctly identified the rank `r r_star` solution as the best (lowest relative error `rel_err` rate). The corresponding `eta` = `r eta_star`. The top three parameter settings also seem to have reasonable `S_sparsity` levels as well (all are above `0.95`). The next three parameter settings seem to under-regularize the sparse `S` matrix by quite a bit, as 80% of entries are non-zero. We will take the top parameters identified by the grid search in this instance. Had the very top parameters yielded a sparsity of e.g. `0.7`, we likely then would have preferred the second set of parameters with sparisities in the `0.9`s. This decision would have been grounded in prior assumptions about the amount of outliers to expect in the mixtuere. For more on the interpreation of grid search results, consult the documentation for the `grid_search_cv()` function. ## Running PCP Now we can run our PCP model: ```{r rrmc} pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod) ``` We can inspect the evolution of the objective function over the course of PCP's optimization: ```{r obj} plot(pcp_model$objective, type = "l") ``` And the output `L` matrix: ```{r output L} plot_matrix(pcp_model$L) matrix_rank(pcp_model$L) ``` Let's briefly inspect PCP's estimate of the sparse matrix `S`, and fix any values that are "practically" zero using the `hard_threshold()` function. The histogram below shows a majority of the entries in `S` are between -0.4 and 0.4, so we will call those values "practically" zero, and the rest true outlying exposure events. We can then calculate the sparsity of `S` with `sparsity()`: ```{r sparse} hist(pcp_model$S) pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4) plot_matrix(pcp_model$S) sparsity(pcp_model$S) ``` ## Benchmarking with PCA Before evaluating our PCP model, let's see how well a more traditional method such as Principal Component Analysis (PCA) can recover `L_0`, to provide a benchmark for comparison. The `proj_rank_r()` function (project matrix to rank `r`) approximates an input matrix as low-rank using a rank-`r` truncated SVD, the same way PCA approximates a low-rank matrix. Normally, a researcher would need to determine `r` subjectively. We will give PCA an advantage by sharing PCP's discovery from the above grid search that the solution should be of rank `r r_star`: ```{r pca} L_pca <- proj_rank_r(D_imputed, r = r_star) ``` ## Evaluating PCP against the ground truth Finally, let's see how we did in recovering `L_0` and `S_0`. We will examine the relative error between our model's estimates and the simulated ground truth matrices. We use the Frobenius norm to calculate the relative errors between the matrices: ```{r performance metrics} data.frame( "Obs_rel_err" = norm(L_0 - D_imputed, "F") / norm(L_0, "F"), "PCA_L_rel_err" = norm(L_0 - L_pca, "F") / norm(L_0, "F"), "PCP_L_rel_err" = norm(L_0 - pcp_model$L, "F") / norm(L_0, "F"), "PCP_S_rel_err" = norm(S_0 - pcp_model$S, "F") / norm(S_0, "F"), "PCP_L_rank" = matrix_rank(pcp_model$L), "PCP_S_sparsity" = sparsity(pcp_model$S) ) ``` PCP outperformed PCA by quite a bit! PCP's relative recovery error on the `L_0` matrix stood at only `r round(norm(L_0 - pcp_model$L, "F") / norm(L_0, "F") * 100, 2)`%, compared to an observed relative error of `r round(norm(L_0 - D_imputed, "F") / norm(L_0, "F") * 100, 2)`% and PCA's relative error of `r round(norm(L_0 - L_pca, "F") / norm(L_0, "F") * 100, 2)`%. PCP's sparse matrix estimate was only off from the ground truth `S_0` by `r round(norm(S_0 - pcp_model$S, "F") / norm(S_0, "F") * 100, 2)`%. ## After PCP We can now pair our estimated `L` matrix with any matrix factorization method of our choice (e.g. PCA, factor analysis, or non-negative matrix factorization) to extract the latent chemical exposure patterns (an example of what this looks like is in `vignette("pcp-applied")`, where non-negative matrix factorization is used to extract patterns from PCP's `L` matrix). These patterns, along with the isolated outlying exposure events in `S`, can then be analyzed with any outcomes of interest in downstream epidemiological analyses.