--- title: "Correlation Subset Selection with corrselect" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Correlation Subset Selection with corrselect} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(corrselect) ``` ## Overview `corrselect` identifies all maximal subsets of variables whose pairwise correlations stay below a chosen threshold. This process reduces multicollinearity and redundancy before modeling, while preserving interpretability. Unlike greedy or stepwise approaches, `corrselect` exhaustively searches for all valid subsets using fast, exact algorithms. It is fully model-agnostic, making it suitable as a preprocessing step for regression, clustering, feature selection, and other analyses. Given a threshold \( t \in (0,1) \), the functions `corrSelect()` (data-frame interface) and `MatSelect()` (matrix interface) enumerate all **maximal** subsets \( S \) of variables satisfying: \[ \forall i, j \in S,\ i \neq j: \ |r_{ij}| < t \] where \(r_{ij}\) denotes the chosen correlation measure between variables \(i\) and \(j\). Enumeration relies on two exact graph-theoretic algorithms: 1. **Eppstein–Löffler–Strash (ELS)**, a degeneracy-ordered backtracking algorithm optimized for sparse graphs. 2. **Bron–Kerbosch (BK)**, a classical recursive clique-finding method, with optional pivoting to reduce search space. Results are returned as a `CorrCombo` S4 object containing each subset’s variable names and summary statistics (`avg_corr`, `min_corr`, `max_corr`). You can then extract subsets from the original data via `corrSubset()`. Because the procedure does not depend on any downstream model, it cleanly separates “feature curation” from “model fitting” and supports multiple correlation measures (`pearson`, `spearman`, `kendall`, `bicor`, `distance`, `maximal`). --- ## Quick Start (`CorrSelect`) ### Simulated numeric example ```{r} set.seed(42) n <- 100 df <- data.frame( A = rnorm(n), B = rnorm(n), C = rnorm(n), D = rnorm(n), E = rnorm(n) ) df$F <- df$A * 0.9 + rnorm(n, sd = 0.1) # strongly correlated with A ``` ### Basic selection ```{r} res <- corrSelect(df, threshold = 0.7) res as.data.frame(res) ``` ```{r} corrSubset(res, df, which = 1)[1:10,] ``` ### Forcing variables into all subsets ```{r} res2 <- corrSelect(df, threshold = 0.7, force_in = "A") res2 ``` ### Using a different correlation method ```{r} res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman") res3 ``` ## Matrix Interface (`MatSelect`) If you already computed a correlation matrix or want to apply the method to precomputed correlations: ```{r} mat <- cor(df) res4 <- MatSelect(mat, threshold = 0.7) res4 ``` Selecting subsets: ```{r} MatSelect(mat, threshold = 0.5) ``` Force variable 1 into every subset: ```{r} MatSelect(mat, threshold = 0.5, force_in = 1) ``` ## Mixed Data Types (`assocSelect`) ```{r} df_ass <- data.frame( height = rnorm(15, 170, 10), weight = rnorm(15, 70, 12), group = factor(rep(LETTERS[1:3], each = 5)), score = ordered(sample(c("low","med","high"), 15, TRUE)) ) # keep every subset whose internal associations ≤ 0.6 res5 <- assocSelect(df_ass, threshold = 0.6) res5 ``` --- ## Changing Correlation Method By default, `corrSelect()` uses Pearson correlation. You can choose alternatives with the `cor_method` argument: - `"pearson"`: linear correlation (default) - `"spearman"`: rank-based monotonic association - `"kendall"`: Kendall’s tau - `"bicor"`: robust biweight midcorrelation (`WGCNA::bicor`) - `"distance"`: distance correlation (`energy::dcor`) - `"maximal"`: maximal information coefficient (`minerva::mine`) Example: ```{r} res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman") res6 ``` --- ## Handling Mixed Data Types The function `assocSelect()` extends `corrSelect()` to support **mixed data types** — including numeric, factor, and ordered variables — by using appropriate association measures for each variable pair. Instead of a single correlation matrix, it constructs a **generalized association matrix** using the following logic: | Variable 1 | Variable 2 | Method Used | |------------------|------------------|----------------| | numeric | numeric | `pearson` (default; customizable) | | numeric | factor | `eta` | | numeric | ordered | `spearman` (default; customizable) | | factor | factor | `cramersv` | | factor | ordered | `cramersv` | | ordered | ordered | `spearman` (default; customizable) | The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments: ```{r} assocSelect(df_ass, method_num_num = "kendall", method_num_ord = "spearman", method_ord_ord = "kendall" ) ``` All other combinations use fixed methods (`eta` or `cramersv`) appropriate for measuring association strength. ### Example with Mixed Types ```{r} df_ass <- data.frame( height = rnorm(10), weight = rnorm(10), group = factor(sample(c("A", "B"), 10, replace = TRUE)), score = ordered(sample(1:3, 10, replace = TRUE)) ) res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE) res7 ``` Each pairwise association is bounded to [0,1] and treated analogously to correlation. --- ## Theory Given a symmetric correlation matrix \( R \in \mathbb{R}^{p \times p} \), we seek all **maximal subsets** \( S \subseteq \{1, \dots, p\} \) such that: \[ \forall i, j \in S,\ i \neq j: \ |R_{ij}| < t \] for a fixed threshold \( t \in (0, 1) \). This is equivalent to finding all **maximal cliques** in the **thresholded correlation graph**, where: - Nodes represent variables - Edges connect nodes whose absolute correlation is **below** the threshold A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit. --- ## Algorithms ### ELS (Eppstein–Löffler–Strash) The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using **degeneracy ordering**: 1. Compute a degeneracy ordering \( v_1, \dots, v_p \). 2. For each \( i \), extend current clique \( S \) from \( \{v_i\} \) within candidate set \( C = \{v_{i+1}, \dots, v_p\} \). 3. Recursively build cliques, pruning when no further vertices can be added. Formally, define: \[ \text{extend}(S, C) = \begin{cases} S, & C = \emptyset, \\ \bigcup_{v \in C} \text{extend}(S \cup \{v\},\ C \setminus (N(v) \cup \{v\})), & \text{otherwise}. \end{cases} \] ELS avoids redundant exploration, achieving good performance on typical correlation graphs. ### Bron–Kerbosch (with Pivoting) The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting: Let \( R \) = current clique, \( P \) = prospective nodes, \( X \) = excluded nodes. Then: \[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset, \\ \text{for each } v \in P \setminus N(u): \\ \quad \text{BK}(R \cup \{v\},\ P \cap N(v),\ X \cap N(v)), \ \quad P \leftarrow P \setminus \{v\},\ X \leftarrow X \cup \{v\}. \end{cases} \] Choosing a pivot \( u \in P \cup X \) and iterating over \( P \setminus N(u) \) reduces recursive calls. --- ## Why corrselect? Most existing R tools: - Filter one variable at a time (e.g. `findCorrelation`) - Use greedy or backward-selection heuristics - Do not enumerate **all** valid subsets **corrselect** uniquely provides: - **Exact** enumeration of maximal subsets - Support for multiple correlation measures - Optional forcing of variables - Full inspection via `CorrCombo` objects - Fast C++ implementations via Rcpp This makes it ideal for pipelines where **interpretability** and **completeness** are essential. --- ## Inspecting Results Convert results for downstream use: ```{r} df_res <- as.data.frame(res) head(df_res) ``` Extract individual subsets: ```{r} lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10)) ``` Summarize correlation metrics: ```{r} # Number and size of subsets length(res@subset_list) summary(lengths(res@subset_list)) # Summaries of within-subset correlations summary(res@max_corr) summary(res@avg_corr) ``` --- ## CorrCombo Object Structure A `CorrCombo` S4 object contains: - `subset_list`: list of character vectors (variable names) - `avg_corr`, `min_corr`, `max_corr`: numeric vectors of correlation metrics - `threshold`, `forced_in`, `search_type`, `cor_method`, `n_rows_used` - Attribute `use_pivot` (if applicable) Inspect slots: ```{r} str(res@subset_list) ``` --- ## Session Info ```{r} sessionInfo() ```