% !Rnw weave = Sweave %\documentclass[article]{jss} \documentclass[nojss]{jss} % Package includes \usepackage[utf8]{inputenc} \usepackage[english]{babel} \usepackage{booktabs} %% \VignetteIndexEntry{OPTICS K-Xi Density-Based Clustering} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \author{Thomas Charlon\\University of Geneva} \title{\pkg{opticskxi}: OPTICS K-Xi Density-Based Clustering} \Plainauthor{Thomas Charlon} \Plaintitle{opticskxi: OPTICS K-Xi Density-Based Clustering} \Shorttitle{\pkg{opticskxi}: OPTICS K-Xi Density-Based Clustering} \Address{ Thomas Charlon\\ Stochastic Information Processing group\\ Department of Computer Science\\ University of Geneva\\ 1227 Carouge, Switzerland\\ E-mail: \email{charlon@protonmail.com} } \Abstract { Density-based clustering methods are well adapted to the clustering of high-dimensional data and enable the discovery of core groups of various shapes despite large amounts of noise. The \pkg{opticskxi} R package provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters. This article first introduces density-based algorithms with simulated datasets, then presents and evaluates the k-Xi cluster extraction method. Finally, the models comparison framework is described and experimented on 2 genetic datasets to identify groups and their discriminating features. The k-Xi algorithm is a novel OPTICS cluster extraction method that specifies directly the number of clusters and does not require fine-tuning of the steepness parameter as the OPTICS Xi method. Combined with a framework that compares models with varying parameters, the OPTICS k-Xi method can identify groups in noisy datasets with unknown number of clusters. } \Keywords{DBSCAN, OPTICS, density-based clustering, hierarchical clustering} %% need no \usepackage{Sweave.sty} \begin{document} \newcommand{\figjss}[2]{ \begin{figure}[ht] \centering \includegraphics[width=\textwidth]{opticskxi-#1} \caption{\label{fig:#1}#2} \end{figure} } \section{Introduction} Density-based clustering methods detect groups of similar observations based on their distance to a given number of their nearest neighbors. In contrast with other clustering methods as k-means or Gaussian mixture models, they do not expect the observed data to follow Gaussian or other parametric distributions and they can thus detect groups of various shapes. In this article, density-based clustering algorithms are first presented on simulated datasets using the \pkg{dbscan} package~\citep{dbscan-R}, and limitations due to clusters of varying densities and fine-tuning of parameters are described. A novel cluster extraction algorithm, OPTICS k-Xi, is then presented and evaluated on the datasets. Finally, a framework to compare multiple k-Xi models with varying parameters is detailed and experimented on 2 genetic datasets of Schizophrenia and Crohn's disease patients, to enable further investigation of the best models and identify genetic signatures of core groups. \subsection{DBSCAN} DBSCAN~\citep{ester1996density} is a well-known density-based clustering algorithm with 3 parameters: a distance matrix, a number of neighbors, and a reachability distance threshold. The algorithm first searches for core points in the distance matrix, \textit{i.e.} points that have distances from at least a given number of points, the number of neighbors, smaller than the reachability threshold. If one is found, the core point and its neighbors form a cluster, and if additional core points are found within its neighbors, the cluster is expanded to also include their neighbors, iteratively until no core points are discovered in the neighbors. Additional clusters are then formed similarly for each core point not yet assigned and finally remaining non-assigned points are considered noise. In the \pkg{dbscan} package, by default the euclidean distance is used, with 5 neighbors, and the distance threshold must be fine-tuned. In these simulated datasets from the \pkg{factoextra} and \pkg{dbscan} R packages, points are organized by various shapes along with some noise, on 2 variables. With specific distance thresholds, DBSCAN successfully detects the shapes and the noise (Figure~\ref{fig:shape_ds3}). <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) @ \SweaveOpts{fig=TRUE, include=FALSE, height=5} <>= library('opticskxi') data('multishapes') dbscan_shapes <- dbscan::dbscan(multishapes[1:2], eps = 0.15) gg_shapes <- cbind(multishapes[1:2], Clusters = dbscan_shapes$cluster) %>% ggpairs(group = 'Clusters') data('DS3', package = 'dbscan') dbscan_ds3 <- dbscan::dbscan(DS3, minPts = 25, eps = 12) gg_ds3 <- cbind(DS3, Clusters = dbscan_ds3$cluster) %>% ggpairs(group = 'Clusters') cowplot::plot_grid(gg_shapes, gg_ds3, nrow = 2, labels = c('(a)', '(b)'), label_x = 0.9) @ \figjss{shape_ds3}{DBSCAN clusterings of various shapes and noise (cluster 0). (a): Multishapes dataset. (b): DS3 dataset.} However, DBSCAN uses a fixed distance threshold and thus can not detect clusters of varying densities. In this simulated Gaussian data with 2 large clusters and 2 smaller, more dense, clusters, DBSCAN detects either the pair of large or small clusters, depending on the distance threshold, but can not detect all 4 clusters (Figure~\ref{fig:multi_gauss}). <>= n <- 1e3 set.seed(0) multi_gauss <- cbind.data.frame( x = c(rnorm(n / 2, -3), rnorm(n / 4, 3), rnorm(n / 4, 3, .2)), y = c(rnorm(n * .75), rnorm(n / 8, 1, .2), rnorm(n / 8, -1, .2))) dbscan_gauss <- dbscan::dbscan(multi_gauss, minPts = 30, eps = .5) gg_mgauss <- cbind(multi_gauss, Clusters = dbscan_gauss$cluster) %>% ggpairs(group = 'Clusters') gg_mgauss_small <- dbscan::dbscan(multi_gauss, minPts = 30, eps = .2) %$% cbind(multi_gauss, Clusters = cluster) %>% ggpairs(group = 'Clusters') cowplot::plot_grid(gg_mgauss, gg_mgauss_small, nrow = 2, labels = c('(a)', '(b)'), label_x = .9) @ \figjss{multi_gauss}{DBSCAN clustering of hierarchical Gaussian clusters. (a): Detection of large clusters. (b): Detection of small clusters.} \subsection{OPTICS} OPTICS~\citep{ankerst1999optics} is another density-based algorithm that produces an ordering and a distance profile of observations, similar to a tree-like dendogram, and enables the detection of clusters of varying densities with the cluster extraction method OPTICS Xi~\citep{ankerst1999optics}. OPTICS uses at least 2 parameters, a distance matrix and a number of neighbors, and produces a distance profile that reveals the density structure of the dataset and can be used to extract clusters. The algorithm iteratively explores point neighborhoods in the order of lowest to highest core distance, \textit{i.e.} the maximum distance from a point to a given number of its nearest neighbors, and returns the orders and the reachability distances of successive points, \textit{i.e.} the maximum between the core distance of the point and the distance from it to the previous point. Low reachability-distances regions, or valleys, thus represent clusters and are separated by peaks, \textit{i.e.} points with high reachability distances. OPTICS can be used to fine-tune the distance threshold in DBSCAN, as the DBSCAN method is equivalent to a horizontal threshold on the reachability plot. \subsection{OPTICS Xi} To extract clusters of varying densities from OPTICS profiles, the OPTICS Xi~\citep{ankerst1999optics} algorithm uses a steepness threshold. The differences between reachability distances of successive points are first computed. Then, for each absolute distance difference above the given threshold, all adjacent points with a smaller reachability distance form a cluster. Two cases are distinguished when forming a cluster, steep down or steep up areas, in which the reachability distance differences that delimit the cluster are negative or positive, \textit{i.e.} when the first observation has a reachability higher or lower than the second observation, respectively. \begin{itemize} \item In a steep down area, both observations that produce the large distance difference are part of the cluster, and all successive points with a reachability distance smaller than the first observation, are part of the cluster. \item In a steep up area, only the first observation is part of the cluster, and all previous points with a reachability distance smaller than the second observation, and the adjacent previous point, are part of the cluster. \end{itemize} OPTICS Xi thus detects clusters of varying densities, possibly hierarchical, although the Xi steepness parameter must be fine-tuned. In the simulated hierarchical Gaussian data, OPTICS Xi successfully detects both the large and small clusters with Xi = 0.03 (Figure~\ref{fig:multi_gauss_xi}). \SweaveOpts{height=2} <>= optics_gauss <- dbscan::optics(multi_gauss, minPts = 30) xi_gauss <- dbscan::extractXi(optics_gauss, xi = 0.03) ggplot_optics(optics_gauss, groups = xi_gauss$cluster) @ \figjss{multi_gauss_xi}{OPTICS profile of the hierarchical Gaussian clusters, colored by OPTICS Xi clustering.} \section{OPTICS k-Xi} The \pkg{opticskxi} package provides a variant OPTICS cluster extraction algorithm, k-Xi, that specifies directly the number of clusters and does not require fine-tuning a parameter. Instead of using a fixed distance difference threshold OPTICS Xi, the k-Xi algorithm iteratively investigates the largest differences until at the given number of clusters are defined. \subsection{Algorithm} For each successive largest difference, OPTICS k-Xi will attempt to form a cluster of all adjacent points with a smaller reachability-distance, distinguishing steep down and up areas similarly as OPTICS Xi (detailed above). If the newly formed cluster contains less observations than the \code{pts} parameter, or if it reduces the size of a previously formed cluster below the \code{pts} parameter, the new cluster is discarded and the next largest difference is considered. The algorithm then stops when the number of clusters has reached the \code{n_xi} parameter, or when the number of largest differences considered has reached the \code{max_loop} threshold, by default 50. The \code{pts} parameter is set by default to the \code{minPts} parameter used to compute the OPTICS profile, and avoids introducing small clusters due to nearby large distance differences. \subsection{Results} In the hierarchical Gaussian data, OPTICS k-Xi successfully detects the large and small clusters (Figure~\ref{fig:multi_gauss_kxi}). <>= kxi_gauss <- opticskxi(optics_gauss, n_xi = 4, pts = 100) ggplot_optics(optics_gauss, groups = kxi_gauss) @ \figjss{multi_gauss_kxi}{OPTICS profile of the hierarchical Gaussian clusters, colored by OPTICS k-Xi clustering.} In the multishapes and the DS3 datasets, OPTICS k-Xi also successfully detects the shapes, but the noise is included in the largest cluster (Figure~\ref{fig:kxi_shapes_ds3}). <>= gg_shapes_optics <- dbscan::optics(multishapes[1:2]) %>% ggplot_optics(groups = opticskxi(., n_xi = 5, pts = 30)) gg_ds3_optics <- dbscan::optics(DS3, minPts = 25) %>% ggplot_optics(groups = opticskxi(., n_xi = 6, pts = 100)) cowplot::plot_grid(gg_shapes_optics, gg_ds3_optics, nrow = 2, labels = c('(a)', '(b)'), label_x = .9) @ \figjss{kxi_shapes_ds3}{OPTICS profiles colored by OPTICS k-Xi clusterings. (a): Multishapes dataset. (b): DS3 dataset.} \section{Models comparisons by distance-based metrics} To explore complex datasets where clusters are not well defined, k-Xi models with various distances, number of points, and number of clusters may be investigated and compared. Furthermore, in datasets with many variables, dimension reduction methods as principal component analysis (PCA) or independent component analysis (ICA) may be required prior to the clustering to summarize information. The \pkg{opticskxi} package provides a framework to efficiently compare multiple k-Xi models with varying parameters and to extract and visualize the models with highest metrics, to enable further investigation of the clusters of the best models. \subsection{Framework} The main function, \code{opticskxi_pipeline}, inputs a data frame of k-Xi clustering parameters and returns corresponding clustering results and their distance-based metrics. Parameters are specified using a data frame with the following columns: \begin{itemize} \item \code{dim_red}: Optional dimension reduction: 'PCA', 'ICA' (using the \pkg{fastICA} package~\citep{fastICA-R}) \item \code{n_dim_red}: Optional number of components of the dimension reduction \item \code{dist}: Distance, one of the 11 distances from the \pkg{amap} package~\citep{amap-R} \item \code{pts}: Number of points for OPTICS (\code{minPts}) and k-Xi \item \code{n_xi}: Number of clusters for k-Xi \end{itemize} To efficiently compute multiple k-Xi models with varying dimension reductions and parameters, the framework proceeds step-by-step, by first computing the unique dimension reduction matrices required, then unique distance matrices, unique OPTICS models, and finally k-Xi cluster extractions. Distance-based metrics are then measured for each model using the \pkg{fpc} package~\citep{fpc-R}. The following metrics are stored: \code{avg.silwidth}, \code{bw.ratio}, \code{ch}, \code{pearsongamma}, \code{dunn}, \code{dunn2}, \code{entropy}, \code{widestgap}, \code{sindex}. Finally, clusters and metrics are binded to the input parameter data frame that defines the unique parameters of each model, in the columns \code{clusters} and \code{metrics}. Three functions can be used directly on the results data frame to extract specific models and investigate the distance profiles and the clusterings: \begin{itemize} \item \code{get_best_kxi}: Subset the data frame by specifying a metric and one or more ranks, in decreasing order of the metric. \item \code{ggplot_kxi_metrics}: Plot metrics of the top ranked k-Xi models, by default the 8 models with highest average silhouette width. Additional metrics can be displayed next to the one used for ranking, by default the between-within ratio. \item \code{gtable_kxi_profiles}: Plot OPTICS profiles of the top ranked k-Xi models, by default the 4 models with highest average silhouette width. \end{itemize} \subsection{Schizophrenia patients and controls} In this dataset from the \pkg{gap} R package~\citep{gap-R}, 6 alleles from the chromosome 6 gene \textit{HLA} were measured from 94 schizophrenia patients and 177 controls. All combinations of the following OPTICS k-Xi parameters are computed: \begin{itemize} \item Distance: Manhattan, Euclidean, absolute Pearson, absolute correlation \item Number of clusters: 3 to 5 \item Number of points: 20, 30, 40 \end{itemize} The 8 best models by average silhouette width are first visualized and reveal that all use Manhattan or Euclidean distances and half use a \code{pts} parameter of 20 (Figure~\ref{fig:hla_metrics}). \SweaveOpts{height=3} <>= data('hla') m_hla <- hla[-c(1:2)] %>% scale df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30, 40), dist = c('manhattan', 'euclidean', 'abscorrelation', 'abspearson')) df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla) ggplot_kxi_metrics(df_kxi_hla, n = 8) @ \figjss{hla_metrics}{Distance-based metrics of the 8 best k-Xi clusterings of the HLA dataset, ranked by decreasing average silhouette width. All the models use Manhattan or Euclidean distances and half use a \code{pts} parameter of 20.} The OPTICS profiles of the 4 best models are then visualized and reveal that the two best models only differ by their number of clusters, 3 or 4, and that the third and fourth models have hierarchical clusters (Figure~\ref{fig:hla_profiles}). <>= gtable_kxi_profiles(df_kxi_hla) %>% plot @ \figjss{hla_profiles}{OPTICS profiles of the 4 best k-Xi clusterings of the HLA dataset, ranked by decreasing average silhouette width. The two best models only differ by their number of clusters, 3 or 4, and the third and fourth models have hierarchical clusters.} The second best model is then selected to investigate the model with 4 clusters. To assess if patients are significantly enriched in each group, standardized Pearson residuals are computed using \code{chisq.test}. One disease is enriched or depleted in one group if the residual is above or below 2, respectively~\citep{friendly1994mosaic}. Results show cluster 4 is enriched in Schizophrenia patients ($residual = 3.98$): 58\% of individuals are patients, although the distribution in the complete dataset is 34\%; and that cluster 2 is enriched in controls ($residual = 2.45$) (Table~\ref{tab:hla}). <>= best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2) clusters_hla <- best_kxi_hla$clusters hla$id %<>% `levels<-`(c('Controls', 'Sch. patients')) residuals_table(clusters_hla, hla$id) %>% print_table('HLA') @ The groups are finally visualized using PCA dimension reduction, and the contributions of variables are displayed to identify the contributions of the genetic markers. On PCA, the genetic markers most discriminating cluster 4 from other observations are \textit{DQB.a2}, \textit{DQR.a2}, and \textit{DQA.a2} (Figure~\ref{fig:hla_dimred}). <>= fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>% ggpairs('Clusters', ellipses = TRUE, variables = TRUE) @ \figjss{hla_dimred}{PCA of the HLA dataset colored by k-Xi clustering with 95\% confidence ellipses.} \subsection{Crohn's disease patients and relatives} In this other dataset from the \pkg{gap} package~\citep{gap-R}, 212 single nucleotide polymorphisms (SNPs) from chromosome 5 (5q31) were measured from 129 Crohn's disease patients and their 2 parents. Since the number of variables is high, dimension reduction methods are applied to the clustering. All combinations of the following OPTICS k-Xi parameters are computed: \begin{itemize} \item Dimension reduction: PCA, ICA \item Number of dimension reduction components: 4, 6, 8 \item Distance: Euclidean, absolute Pearson, absolute correlation \item Number of clusters: 3 to 5 \item Number of points: 30, 40, 50 \end{itemize} The 8 best models by average silhouette width mostly use the Absolute Pearson or Absolute Correlation distances and 3 clusters (Figure~\ref{fig:crohn_metrics}). <>= data('crohn') m_crohn <- crohn[-c(1:6)] %>% scale df_params_crohn <- expand.grid(n_xi = 3:5, dim_red = c('PCA', 'ICA'), dist = c('euclidean', 'abscorrelation', 'abspearson'), pts = c(30, 40, 50), n_dimred_comp = c(4, 6, 8)) df_kxi_crohn <- opticskxi_pipeline(m_crohn, df_params_crohn) ggplot_kxi_metrics(df_kxi_crohn) @ \figjss{crohn_metrics}{Distance-based metrics of the 8 best k-Xi clusterings of the Crohn dataset, ranked by decreasing average silhouette width.} The OPTICS profiles of the 4 best models reveal that all have hierarchical clusters (Figure~\ref{fig:crohn_profiles}). <>= gtable_kxi_profiles(df_kxi_crohn) %>% plot @ \figjss{crohn_profiles}{OPTICS profiles of the 4 best k-Xi clusterings of the Crohn dataset, ranked by decreasing average silhouette width.} In the best model, cluster 2 is enriched in Crohn's disease patients ($residual = 3.73$): 46\% of individuals are patients, although only 37\% in the complete dataset; and cluster 3 is enriched in controls ($residual = 3.05$) (Table~\ref{tab:crohn}). <>= best_kxi_crohn <- get_best_kxi(df_kxi_crohn, rank = 1) clusters_crohn <- best_kxi_crohn$clusters crohn$crohn %<>% factor %>% `levels<-`(c('Controls', 'Crohn patients')) residuals_table(clusters_crohn, crohn$crohn) %>% print_table('Crohn') @ The groups are then visualized using the corresponding dimension reduction, ICA with 4 components, which reveals the hierarchical structure of clusters 1 and 2, discriminated from cluster 3 mostly by the second and third components (Figure~\ref{fig:crohn_dimred}). <>= ica <- fortify_ica(m_crohn, n.comp = 4, sup_vars = data.frame(Clusters = clusters_crohn)) ggpairs(ica, 'Clusters', axes = 1:4, ellipses = TRUE, level = .75) %>% plot @ \figjss{crohn_dimred}{ICA with 4 axes of the Crohn dataset, colored by k-Xi clustering with 75\% confidence ellipses. Clusters 1 and 2 are hierarchical and are discriminated from cluster 3 mostly by the second and third components.} The dimension reduction visualization is then focused on the second and third components to reveal the variables with strong contributions (Figure~\ref{fig:crohn_dimred_23}). <>= ggpairs(ica, 'Clusters', axes = 2:3, ellipses = TRUE, variables = TRUE, n_vars = 3) @ \figjss{crohn_dimred_23}{ICA with 2 axes of the Crohn dataset, colored by k-Xi clustering with 95\% confidence ellipses.} \section{Conclusions} The OPTICS k-Xi algorithm attempts directly to define a given number of clusters and does not require fine-tuning of a steepness parameter as OPTICS Xi. Combined with a framework to compare models with varying parameters, the k-Xi method can identify core groups in noisy datasets with an unknown number of clusters. Recent density-based algorithms as HDBSCAN~\citep{campello2013density} also enable to detect clusters of varying densities and to specify directly the number of clusters to define. In contrast with OPTICS k-XI which iteratively attempts to define clusters until the specified number is reached, HDBSCAN can provide any given number of clusters, based on a hierarchical structure. Future work may thus include comparing OPTICS k-Xi with more recent density-based algorithms as HDBSCAN, and expanding the models comparison framework to include other density-based algorithms. \section{Acknowledgements} This work was inspired by Jérôme Wojcik (Precision for Medicine) and Sviatoslav Voloshynovskiy (University of Geneva). \bibliography{biblio} \end{document}