silhouette {cluster}  R Documentation 
Compute silhouette information according to a given clustering in k clusters.
silhouette(x, ...) ## Default S3 method: silhouette(x, dist, dmatrix, ...) ## S3 method for class 'partition' silhouette(x, ...) ## S3 method for class 'clara' silhouette(x, full = FALSE, subset = NULL, ...) sortSilhouette(object, ...) ## S3 method for class 'silhouette' summary(object, FUN = mean, ...) ## S3 method for class 'silhouette' plot(x, nmax.lab = 40, max.strlen = 5, main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]), col = "gray", do.col.sort = length(col) > 1, border = 0, cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)
x 
an object of appropriate class; for the 
dist 
a dissimilarity object inheriting from class

dmatrix 
a symmetric dissimilarity matrix (n x n),
specified instead of 
full 
logical or number in [0,1] specifying if a full
silhouette should be computed for 
subset 
a subset from 
object 
an object of class 
... 
further arguments passed to and from methods. 
FUN 
function used to summarize silhouette widths. 
nmax.lab 
integer indicating the number of labels which is considered too large for singlename labeling the silhouette plot. 
max.strlen 
positive integer giving the length to which strings are truncated in silhouette plot labeling. 
main, sub, xlab 
arguments to 
col, border, cex.names 
arguments passed

do.col.sort 
logical indicating if the colors 
do.n.k 
logical indicating if n and k “title text” should be written. 
do.clus.stat 
logical indicating if cluster size and averages should be written right to the silhouettes. 
For each observation i, the silhouette width s(i) is
defined as follows:
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the only observation in
its cluster, s(i) := 0 without further calculations).
For all other clusters C, put d(i,C) = average
dissimilarity of i to all observations of C. The smallest of these
d(i,C) is b(i) := \min_C d(i,C),
and can be seen as the dissimilarity between i and its “neighbor”
cluster, i.e., the nearest one to which it does not belong.
Finally,
s(i) := ( b(i)  a(i) ) / max( a(i), b(i) ).
silhouette.default()
is now based on C code donated by Romain
Francois (the R version being still available as
cluster:::silhouette.default.R
).
Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.
silhouette()
returns an object, sil
, of class
silhouette
which is an n x 3 matrix with
attributes. For each observation i, sil[i,]
contains the
cluster to which i belongs as well as the neighbor cluster of i (the
cluster, not containing i, for which the average dissimilarity between its
observations and i is minimal), and the silhouette width s(i) of
the observation. The colnames
correspondingly are
c("cluster", "neighbor", "sil_width")
.
summary(sil)
returns an object of class
summary.silhouette
, a list with components
si.summary
:numerical summary
of the
individual silhouette widths s(i).
clus.avg.widths
:numeric (rank 1) array of clusterwise
means of silhouette widths where mean = FUN
is used.
avg.width
:the total mean FUN(s)
where
s
are the individual silhouette widths.
clus.sizes
:table
of the k cluster sizes.
call
:if available, the call
creating sil
.
Ordered
:logical identical to attr(sil, "Ordered")
,
see below.
sortSilhouette(sil)
orders the rows of sil
as in the
silhouette plot, by cluster (increasingly) and decreasing silhouette
width s(i).
attr(sil, "Ordered")
is a logical indicating if sil
is
ordered as by sortSilhouette()
. In that case,
rownames(sil)
will contain case labels or numbers, and
attr(sil, "iOrd")
the ordering index vector.
While silhouette()
is intrinsic to the
partition
clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from silhouette.default()
with
cutree()
and distance as input.
By default, for clara()
partitions, the silhouette is
just for the best random subset used. Use full = TRUE
to compute (and later possibly plot) the full silhouette.
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.
chapter 2 of Kaufman and Rousseeuw (1990), see
the references in plot.agnes
.
partition.object
, plot.partition
.
data(ruspini) pr4 < pam(ruspini, 4) str(si < silhouette(pr4)) (ssi < summary(si)) plot(si) # silhouette plot plot(si, col = c("red", "green", "blue", "purple"))# with clusterwise coloring si2 < silhouette(pr4$clustering, dist(ruspini, "canberra")) summary(si2) # has small values: "canberra"'s fault plot(si2, nmax= 80, cex.names=0.6) op < par(mfrow= c(3,2), oma= c(0,0, 3, 0), mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2)) for(k in 2:6) plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE) mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101", outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame() ## the same with clusterwise colours: c6 < c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20") for(k in 2:6) plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE, col = c6[1:k]) par(op) ## clara(): standard silhouette is just for the best random subset data(xclara) set.seed(7) str(xc1k < xclara[ sample(nrow(xclara), size = 1000) ,]) # rownames == indices cl3 < clara(xc1k, 3) plot(silhouette(cl3))# only of the "best" subset of 46 ## The full silhouette: internally needs large (36 MB) dist object: sf < silhouette(cl3, full = TRUE) ## this is the same as s.full < silhouette(cl3$clustering, daisy(xc1k)) stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tolerance = 0)) ## color dependent on original "3 groups of each 1000": % __FIXME ??__ plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000, main ="plot(silhouette(clara(.), full = TRUE))") ## Silhouette for a hierarchical clustering: ar < agnes(ruspini) si3 < silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above daisy(ruspini)) plot(si3, nmax = 80, cex.names = 0.5) ## 2 groups: Agnes() wasn't too good: si4 < silhouette(cutree(ar, k = 2), daisy(ruspini)) plot(si4, nmax = 80, cex.names = 0.5)