c2c workflow

Mitchell Lyons

2017-07-23

What is c2c?

An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.

How to use c2c

The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the ‘truth’). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.

An example with the iris data

In this vignette we will work through a simple, but hopefully useful, example using the iris data set. We will use a fuzzy clustering algorithm from the e1071 package.

library(c2c)
library(e1071)

Load the iris data set, and prep for clustering

data(iris)
iris_dat <- iris[,-5]

Let’s start with a cluster analysis with 3 groups, since we know that’s where we’re headed, and extract the soft classification matrix

fcm3 <- cmeans(x = iris_dat, centers = 3)
fcm3_probs <- fcm3$membership

Now we want to compare that soft matrix to a set of hard labels; we’ll use the species names. get_conf_mat produces the confusion matrix, and it take two inputs - they can be a matrix or a set of labels

get_conf_mat(fcm3_probs, iris$Species)
##       setosa versicolor virginica
## 1  0.5697694   7.671859 35.837458
## 2  1.2051571  39.686802 13.099824
## 3 48.2250734   2.641340  1.062717

The output confusion matrix shows us the number of shared sites between our clustering solution and the set of labels (species in this case), accounting for the probabalistic memberships. We can see here that our 3 clusters have very clear fidelity to the species. We can also see what the relationship is like if we degrade the clustering to hard labels (this is the case of a traditional error matrix/accuracy assessment)

get_conf_mat(fcm3_probs, iris$Species, make.A.hard = TRUE)
##   setosa versicolor virginica
## 1      0          3        37
## 2      0         47        13
## 3     50          0         0

Nice, a little confusion between versicolor and virginica. Let’s try more clusters and see if we can tease it apart

fcm6 <- cmeans(x = iris_dat, centers = 10)
fcm6_probs <- fcm6$membership
get_conf_mat(fcm6_probs, iris$Species)
##         setosa versicolor  virginica
## 1   9.44825071  0.5148255  0.3076198
## 2   0.10803816  1.8013163 17.4345274
## 3   0.15266162  6.0867765 12.9719453
## 4   0.32861125 11.7003575  1.5248200
## 5   0.25354448 14.8999997  2.6927839
## 6   0.07529015  0.6758497  9.3986276
## 7  13.22512142  0.6026045  0.3000124
## 8   9.58070844  0.5162846  0.2705119
## 9   0.17866290 12.6373067  4.7953535
## 10 16.64911086  0.5646791  0.3037981
get_conf_mat(fcm6_probs, iris$Species, make.A.hard = TRUE)
##    setosa versicolor virginica
## 1       8          0         0
## 2       0          0        23
## 3       0          3        15
## 4       0         13         0
## 5       0         17         1
## 6       0          0        11
## 7      12          0         0
## 8      10          0         0
## 9       0         17         0
## 10     20          0         0

Cleans things up somewhat, but note the uncertainty is hidden when you compare hard clustering. As an aside, when you set make.A.hard = TRUE, the function get_hard is being used, it might be useful elsewhere. Similarly, when you pass a vector of labels to get_conf_mat the function labels_to_matrix makes the binary classification matrix.

head(get_hard(fcm3_probs))
##      1 2 3
## [1,] 0 0 1
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 0 0 1
## [6,] 0 0 1
head(labels_to_matrix(iris$Species))
##   setosa versicolor virginica
## 1      1          0         0
## 2      1          0         0
## 3      1          0         0
## 4      1          0         0
## 5      1          0         0
## 6      1          0         0

You can also compare two soft matrices, for example were could compare the 3- and 10-class classifications we just made

get_conf_mat(fcm3_probs, fcm6_probs)
##           1          2          3         4          5         6
## 1 0.4684233 15.5557914  8.0787995  1.992887  2.6368163 7.8937377
## 2 0.8466211  3.3834162 10.4653741 10.137611 14.3853102 1.8791460
## 3 8.9556517  0.4046743  0.6672098  1.423291  0.8242016 0.3768837
##            7        8         9         10
## 1  0.3967318 0.386854  6.273494  0.3955523
## 2  0.7703569 0.741123 10.643809  0.7390148
## 3 12.9606497 9.239528  0.694020 16.3830209

or we could directly compare two vectors of labels, which is a different way of doing what we already did above.

get_conf_mat(fcm3$cluster, iris$Species)
##   setosa versicolor virginica
## 1      0          3        37
## 2      0         47        13
## 3     50          0         0

Examining the confusion matrix can be enlightening just by itself, but it can be useful to have some more quantitative metrics, particularly if you’re comparing lots of classifications. For exmaple you may be trying to optimise clustering parameters or maybe you’re comparing lots of different clustering solutions. calculate_clustering_metrics does this

conf_mat <- get_conf_mat(fcm3_probs, iris$Species)
calculate_clustering_metrics(conf_mat)
## Percentage agreement WILL be calculated: it will only make sense if the confusion matrix diagonal corresponds to matching classes (i.e. rows and columns are in the same class order)
## $percentage_agreement
## [1] 0.2754619
## 
## $overall_purity
## [1] 0.8249956
## 
## $class_purity
## $class_purity$row_purity
##         1         2         3 
## 0.8130263 0.7350526 0.9286709 
## 
## $class_purity$col_purity
##     setosa versicolor  virginica 
##  0.9645015  0.7937360  0.7167492 
## 
## 
## $overall_entropy
## [1] 0.4504429
## 
## $class_entropy
## $class_entropy$row_entropy
##         1         2         3 
## 0.7629091 0.9445960 0.4325417 
## 
## $class_entropy$col_entropy
##     setosa versicolor  virginica 
##  0.2534011  0.9035868  0.9687373

Purity and entropy are as defined in Manning et al. (2008). Overall and per-class metrics are included, as both have uses in different situations. See Lyons et al. (2017) and Foster et al. (2017) for use on a model-based vegetation clustering example. Finally, not the message there about percentage agreement - as it says, only use it if the clustering solutions have the same class order, or are numbers for example, which should stay in order. For a decent classification, it shouldn’t differ much from purity anyway.

References

Foster, Hill and Lyons (2017). “Ecological Grouping of Survey Sites when Sampling Artefacts are Present”. Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211

Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.

Manning, Raghavan and Schütze (2008). Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.