Clustering assigns data points to groups, i.e clusters, so that each group is internally coherent and also different from the others. In practice, however, clusters often turn out to be almost merging with one another. Sharpening in this context refers to increasing contrasts between cluster groups.
The concept of cluster sharpening goes back at least to the 1980s and the works of Tukey (Cleveland 1988). Methods that specifically mention cluster sharpening include dendrogram sharpening (Stanberry, Nandy, and Cordes 2003) and variations on the k-means algorithm (García-Escudero et al. 2008). Some R packages that produce clusterings and provide controls on the level of sharpening include
tclust (Fritz, Garcia-Escudero, and Mayo-Iscar 2012) and
trimcluster (Hennig 2018).
In contrast to methods that produce de-novo clusterings with specific sharpness properties, the
ksharp package aims to adjust existing clusterings. The package provides a general interface for this purpose along with several distinct algorithms. All are compatible with clusterings produced using the
stats package, the
cluster package (Maechler et al. 2019), the
dbscan package (Hahsler, Piekenbrock, and Doran 2019), as well as custom-made assignments.
This vignette first explains cluster sharpening in the methods section. It then demonstrates the algorithms implemented in the package on a range of examples. Throughout, most code is hidden for readability, but is available in the package source.
As a form of motivation and illustration, let’s look at the toy dataset below.
The raw data shows two circular blobs that are nearby on the plane (above, left). k-means clustering using two colors partitions the points into groups that appear symmetrical, except on the boundary between the groups (above, right). The distorted shapes suggest that the true groups may be in fact overlapping and, therefore, certain items assigned to one group may should belong to the other.
The ambiguous points can be marked by demanding increasing levels of sharpness.
With a mild level of sharpening (above, left), a few of the points on the boundary fade into a third group, displayed in gray. Increasing the sharpening threshold makes this group larger. At the cost of discarding some of the data, the leftover groups become more distinct and well-separated (above, right).
To achieve this effect using the
ksharp package, first we cluster the dataset
kdata.1 using k-means.
library(ksharp) km1 = kmeans(kdata.1, centers=2) table(km1$cluster)
## 1 2 ## 242 258
The algorithm partitions the data into groups of roughly equal size. Next, we sharpen the clustering using command
ks1 = ksharp(km1, threshold=0.05, data=kdata.1) table(ks1$cluster)
## 0 1 2 ## 25 231 244
The sharpening function creates an additional small group with cluster index
0 containing 5% of the data.
On the first run, the sharpening command requires access to the original data and can take a moment to compute various sharpening-related metrics. But once computed, it is straightforward to re-use those metrics to adjust the level of sharpening. We can adjust the size of the left-out group by varying the sharpness threshold.
ks1.10 = ksharp(ks1, threshold=0.1) table(ks1.10$cluster)
## 0 1 2 ## 50 222 228
ks1.20 = ksharp(ks1, threshold=0.2) table(ks1.20$cluster)
## 0 1 2 ## 100 202 198
By increasing the threshold, we delegate a larger proportion of the points to the group with index
0, reproducing the effect in the figure above.
Before moving to more intricate examples, let’s revisit the dataset with the two overlapping blobs using the three sharpening methods.
Each row represents output from a sharpening method. From left to right, increasing the threshold masks out more points. Although the results differ slightly across methods, the changes here appear to be minor.
Another dataset with two groups is
Here, points are arranged in non-circular shapes (above, left) and this can confuse the k-means algorithm (above, center). To avoid that, we can create the initial clustering using a density-based algorithm instead,
dbscan (above, right).
Sharpening of a
dbscan clustering proceeds in the same way as before. (A caveat is that
dbscan provides cluster assignments in unnamed lists.
ksharp requires a named list, so the initial clustering must be adjusted manually before sharpening.)
The silhouette and medoid methods (top and middle) tend to remove points from the left and right sides of the rectangles. The neighbor-based method (bottom) instead subtracts from the upper and lower sides.
Differences between sharpening methods can also be observed when data form groups with non-convex shapes. As an example, let’s use dataset
kdata.3 with a clustering by
The result from
dbscan contains three main groups. It also already assigns some points to a noise group, which has a cluster index of 0. But we can sharpen the result further.
An interesting feature that appears here is that the presence of a noise group in the original clustering also removes points in the real clusters that might overlap with the noise.
The last example consists of groups that are well defined, but are surrounded by noise,
Here, the initial clustering is generated by k-means with a manual group seeding. As k-means partitions all points to the specified number of clusters, the surrounding noise points are assigned to groups along the four dense regions near the center. Sharpening can in this case remove some of the noise points.
Interestingly, an already-sharp cluster assignment can be produced in this case by
dbscan, which can produce a noise group in its raw output. The above result thus reproduces a similar effect using a different algorithm.
Sharpening is a procedure that adjusts an existing clustering in order to better bring out the distinctions between cluster groups. Package
ksharp implements a specific form of cluster sharpening called ‘sharpening by excision’ (Cleveland, 1988). This works by removing some boundary points from a dataset to increase contrast between the remaining groups.
There are several possible algorithms for cluster sharpening. The
ksharp command delivers three different ones. The object output by function
ksharp contains details relevant to all of them. This makes it possible to adjust the sharpening level of already-sharpened clusterings in time O(N), where N is the number of data points. However, the initial creation of these details requires recalculation and sorting of the distance matrix and can thus be expensive on large datasets. Speeding up these calculations using alternative implementations is a possible avenue for development.
Cleveland, William S. 1988. The Collected Works of John W. Tukey: Graphics 1965-1985. Vol. 5. CRC Press.
Fritz, Heinrich, Luis A. Garcia-Escudero, and Agustin Mayo-Iscar. 2012. “tclust: An R Package for a Trimming Approach to Cluster Analysis.” Journal of Statistical Software 47 (12): 1–26. http://www.jstatsoft.org/v47/i12/.
García-Escudero, Luis A, Alfonso Gordaliza, Carlos Matrán, Agustin Mayo-Iscar, and others. 2008. “A General Trimming Approach to Robust Cluster Analysis.” The Annals of Statistics 36 (3). Institute of Mathematical Statistics: 1324–45.
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 91 (1): 1–30. https://doi.org/10.18637/jss.v091.i01.
Hennig, Christian. 2018. Trimcluster: Cluster Analysis with Trimming. https://CRAN.R-project.org/package=trimcluster.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2019. Cluster: Cluster Analysis Basics and Extensions.
Stanberry, Larissa, Rajesh Nandy, and Dietmar Cordes. 2003. “Cluster Analysis of fMRI Data Using Dendrogram Sharpening.” Human Brain Mapping 20 (4). Wiley Online Library: 201–19.
ksharp is the main API function of the package. This function is designed to be compatible with clusterings produced by k-means and by functions from the
cluster package, e.g.
pam. It is also compatible with output from
dbscan after an adjustment of that clustering output to include data-point names.
The function can also sharpen self-made clusterings, as long as they are prepared as a list with a component
cluster consisting of a named vector associating each data point to a numeric cluster id.
ksharp function, the package also exports other functions tat compute auxiliary information such as silhouette widths. These implementations differ slightly from existing ones from package
cluster; see the documentation, e.g.
help(silinfo) for details.
## R version 3.6.1 (2019-07-05) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 18.04.3 LTS ## ## Matrix products: default ## BLAS: /software/opt/R/R-3.6.1/lib/libRblas.so ## LAPACK: /software/opt/R/R-3.6.1/lib/libRlapack.so ## ## locale: ##  LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_GB.UTF-8 LC_COLLATE=C ##  LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 ##  LC_PAPER=en_GB.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  ksharp_0.1.0.1 dbscan_1.1-5 cluster_2.1.0 Rcssplot_1.0.0 ## ## loaded via a namespace (and not attached): ##  Rcpp_1.0.3 digest_0.6.23 magrittr_1.5 evaluate_0.14 ##  rlang_0.4.2 stringi_1.4.3 rmarkdown_2.0 tools_3.6.1 ##  stringr_1.4.0 xfun_0.11 yaml_2.2.0 compiler_3.6.1 ##  htmltools_0.4.0 knitr_1.26