The FCPS package

Michael C. Thrun

2023-10-18

The package FCPS provides standardized access to state-of-the-art clustering algorithms, datasets defining common clustering challenges, and the estimation of the number of clusters. Additionally, the cluster tendency can be investigated, the number of clusters estimated, and an appropriate accuracy can be computed for arbitrary labels.

The installation guide can be found in the README.md file.

In the following example, the high-dimensional leukemia dataset is loaded and a visualized:

library(FCPS)
data("Leukemia")
Data=Leukemia$Distance
Cls=Leukemia$Cls
ClusterPlotMDS(Data,Cls,main = 'Leukemia',Plotter3D = 'plotly')

The function ClusterPlotMDS used for the figure above provides in case of datasets with dimensionality higher than three a 3D projection using multidimensionl scaling of the R package smacof on CRAN . The user can decide if the rgl package or the plotly package should be used .

Cluster Analysis with FCPS

In the following code a high-dimensional dataset is loaded. The leukemia dataset provides a distance matrix instead of a data matrix. Without further adjustments the function AgglomerativeNestingClustering can be called with the correct number of clusters. The resulting clustering is stored in the list element Cls which is a named vector. The names are defined by the rows of the distance or data matrix. In the next step, the user can rename the clustering to consecutive number from 1 to 6 with 1 beeing the label of the cluster with the biggest size and 6 beeing the cluster of the smallest size. The names will still match all rownames of the data or distances. Besides the Cls element the output list CA stores the original object of the clustering. In the case of hierarchical algorithms another list element stores the Dendrogram which can be visualized with ClusterDendrogram which is shown in the next section.

library(FCPS)
data('Leukemia')
set.seed(123)
ClusterNo=length(unique(Leukemia$Cls))
CA=AgglomerativeNestingClustering(Leukemia$DistanceMatrix,ClusterNo)
Cls=ClusterRenameDescendingSize(CA$Cls)
sum(match(names(Cls),rownames(Leukemia$DistanceMatrix),nomatch = 0)==0)
#> [1] 0

Generating a Clustering Challenge

Any clustering challenge listed in Table~2 can be generated with an arbitrary sample size. Here, Chainlink is selected and visualized in the figure below.

set.seed(600)
library(FCPS)
DataList=ClusterChallenge("Chainlink",SampleSize = 750)
Data=DataList$Chainlink
Cls=DataList$Cls
ClusterPlotMDS(Data,Cls,Plotter3D = 'plotly',main = "Chainlink")
ClusterCount(Cls)
#> $UniqueClusters
#> [1] 1 2
#> 
#> $CountPerCluster
#>   1   2 
#> 376 374 
#> 
#> $NumberOfClusters
#> [1] 2
#> 
#> $ClusterPercentages
#> [1] 50.13333 49.86667

Remark: ClusterPlotMDS detects that the dataset has only three dimensions and instead of projecting the data, visualizes the tree given dimensions.

Testing for Cluster Tendency

The cluster tendency or so-called clusterability can be investigated with the ggplot2 syntax as follows for the example of Chainlink:

set.seed(600)
library(FCPS)
DataList=ClusterChallenge("Chainlink",SampleSize = 750)
Data=DataList$Chainlink
Cls=DataList$Cls
library(ggplot2)
ClusterabilityMDplot(Data)+theme_bw()
#> [1] "This MD-plot is typically for several features at once. By calling as.matrix(), it will be now used with one feature."
#> NULL

The figure presents the result for the sample of Chainlink. The MD plot shows mulitmodality and the statistical testing agrees with the MD plot (p<0.01 that data has no cluster tendency). Therefore, the sample has a high cluster tendency.

Estimating the Number of Clusters

Lets assume that there is no prior knowledge about the Chainlink data available and the hierarchical algorithm single linkage is selected. Looking at the dendrogram the highest change in fusion level is two. However, maybe each of the main clusters has two subclusters presented in Figure 3. Therefore the function ClusterNoEstimation is used to investigate this assumption. Figure 4 presents the Fan plot of the amount of indicators preferring a specific number of clusters for the sample of the Chainlink dataset. Majority vote proposes the cluster number 7 or 3 with the correct cluster number equal to 2 as the second. The appropriate number of clusters would be two because neither 7 or 3 are present in the dendrogram. The following code uses a numerical data matrix for the hierarchical clustering algorithm. If not set otherwise, internally the Euclidean distances computed by the parallelDist are computed and used. Furthermore, the fastcluster is used to compute the tree. The dendextend allows to color the branches user-specifc if the ClusterDendrogram is used. The function ClusterNoEstimation expects an matrix of clusterings, each column one “Cls” ordered in the range of cluster numbers of interest. In this example, the range from 2 to 7 is investigated.

library(FCPS)
set.seed(135)
DataList=ClusterChallenge("Chainlink",SampleSize = 900)
Data=DataList$Chainlink
Cls=DataList$Cls
Tree=HierarchicalClustering(Data,1,"SingleL")[[3]]
ClusterDendrogram(Tree,4,main='Single Linkage')
MaximumNunber=7
clsm <- matrix(data = 0, nrow = dim(Data)[1],  ncol = MaximumNunber)
for (i in 2:(MaximumNunber+1)) {
  clsm[,i-1] <- cutree(Tree,i)
}
out=ClusterNoEstimation(Data,ClsMatrix = clsm, MaxClusterNo = MaximumNunber,PlotIt = TRUE)

Accurate Comparison to a Given Ground Truth

Usually, clustering accuracy can either be computed only correctly for binary classifications or is computed per cluster as shown below. The latter does not allow for a straightforward comparison between algorithms. Often a simple approach of computing the overall accuracy is provided in packages, for example, in see . The following code outlines why the overall accuracy is not correct if computed straightforward. The solution is provided by the function ClusterAccuracy which calculates the correct accuracy of a clustering algorithm:

library(FCPS)
data("Leukemia")
Distance=Leukemia$DistanceMatrix
Classification=Leukemia$Cls
Cls=HierarchicalClustering(Distance,6,"SingleL")$Cls

#Usual Computation Accuracy per Class
cm=as.matrix(table(Cls,Classification))
diag(cm)/rowSums(cm)
#>          1          2          3          4          5          6 
#> 0.06666667 0.00000000 1.00000000 1.00000000 1.00000000 1.00000000
# Usual overall Accuracy
sum(diag(cm)) / sum(cm) 
#> [1] 0.7797834
#e.g.
#MLmetrics::Accuracy(Cls,Classification)
#Correct Computation
ClusterAccuracy(Cls,Classification)
#> [1] 0.9963899
cm
#>    Classification
#> Cls   1   2   3   4   5   6
#>   1   1  14   0   0   0   0
#>   2   0   0 108   0   0   0
#>   3   0   0   1   0   0   0
#>   4   0   0   0 266   0   0
#>   5   0   0   0   0 163   0
#>   6   0   0   0   0   0   1

Further Functionality

This section outlines further functionalities focussing on the possibilities in cases for which the definitions in Table 3 are not met.
The user can transform factors to numerical vectors using the function ClusterCreateClassification and perform simple cluster-based evaluations per column of data or used the created Cls otherwise. In the example below, the mean per cluster and feature is computed with ClusterApply:

library(datasets)
library(FCPS)
Iris=datasets::iris
Data=as.matrix(Iris[,1:4])
SomeFactors=Iris$Species
V=ClusterCreateClassification(SomeFactors)
Cls=V$Cls
V$ClusterNames
#>            1            2            3 
#>     "setosa" "versicolor"  "virginica"
ClusterApply(Data,mean,Cls)
#> $UniqueClusters
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1        5.006       3.428        1.462       0.246
#> 2        5.936       2.770        4.260       1.326
#> 3        6.588       2.974        5.552       2.026
#> 
#> $meanPerCluster
#> [1] "1" "2" "3"

The same computations are possible for distance matrices. The function ClusterApply can also be used to transform distances to data. Exemplary the tetragonula dataset is loaded from the prablcus package . The dataset consists of a data frame with 236 cases and 13 string features. For a brief overview about the data please see . The computed distance matrix can be used in this package directly or via MDS transformation to a numerical matrix:

suppressPackageStartupMessages(library('prabclus',quietly = TRUE))
data(tetragonula)
#Generated Specific Distance Matrix
ta <- alleleconvert(strmatrix=as.matrix(tetragonula[1:236,]))
tai <- alleleinit(allelematrix=ta,distance="none")
Distances=alleledist((unbuild.charmatrix(tai$charmatrix,236,13)),236,13)
Cls=rep(1,nrow(Distances))
DataTrans=ClusterApply(Distances,identity,Cls)$identityPerCluster
dim(DataTrans)
#> NULL
dim(Distances)
#> [1] 236 236

Summary

Fifty-four conventional clustering algorithms are provided in the R}package FCPS on CRAN with consistent input and output. This enables the user to try out many algorithms swiftly. Additionally, 26 statistical approaches for the estimation of the number of clusters, as well as the mirrored density plot (MD-plot) of clusterability, are provided. Moreover, the fundamental clustering problems suite (FCPS) offers a variety of clustering challenges any algorithm should handle when facing real-world data.