pam { cluster } | R Documentation |
Partitioning (clustering) of the data into < code >k clusters “around medoids”, a more robust version of K-means.
pam(x, k, diss = inherits(x, "dist"), metric = c("euclidean", "manhattan"), medoids = NULL, stand = FALSE, cluster.only = FALSE, do.swap = TRUE, keep.diss = !diss && !cluster.only && n < 100, keep.data = !diss && !cluster.only, pamonce = FALSE, trace.lev = 0)
x |
data matrix or data frame, or dissimilarity matrix or object, depending on the value of the < code >diss argument. In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (< code >NAs) < em >are allowed—as long as every pair of observations has at least one case not missing. In case of a dissimilarity matrix, < code >x is typically the output of < code >daisy or < code >dist. Also a vector of length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (< code >NAs) are < em >not allowed. |
k |
positive integer specifying the number of clusters, less than the number of observations. |
diss |
logical flag: if TRUE (default for < code >dist or < code >dissimilarity objects), then < code >x will be considered as a dissimilarity matrix. If FALSE, then < code >x will be considered as a matrix of observations by variables. |
metric |
character string specifying the metric to be used for calculating
dissimilarities between observations. |
medoids |
NULL (default) or length-< code >k vector of integer indices (in < code >1:n) specifying initial medoids instead of using the ‘< em >build’ algorithm. |
stand |
logical; if true, the measurements in < code >x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If < code >x is already a dissimilarity matrix, then this argument will be ignored. |
cluster.only |
logical; if true, only the clustering will be computed and returned, see details. |
do.swap |
logical indicating if the < b >swap phase should happen. The default, < code >TRUE, correspond to the original algorithm. On the other hand, the < b >swap phase is much more computer intensive than the < b >build one for large n, so can be skipped by < code >do.swap = FALSE. |
keep.diss, keep.data |
logicals indicating if the dissimilarities and/or input data < code >x should be kept in the result. Setting these to < code >FALSE can give much smaller results and hence even save memory allocation < em >time. |
pamonce |
logical or integer in < code >0:2 specifying algorithmic short cuts as proposed by Reynolds et al. (2006), see below. |
trace.lev |
integer specifying a trace level for printing diagnostics during the build and swap phase of the algorithm. Default < code >0 does not print anything; higher values print increasingly more. |
The basic < code >pam algorithm is fully described in chapter 2 of Kaufman and Rousseeuw(1990). Compared to the k-means approach in < code >kmeans, the function < code >pam has the following features: (a) it also accepts a dissimilarity matrix; (b) it is more robust because it minimizes a sum of dissimilarities instead of a sum of squared euclidean distances; (c) it provides a novel graphical display, the silhouette plot (see < code >plot.partition) (d) it allows to select the number of clusters using < code >mean(silhouette(pr)[, "sil_width"]) on the result < code >pr <- pam(..), or directly its component < code >pr$silinfo$avg.width, see also < code >pam.object.
When < code >cluster.only is true, the result is simply a (possibly
named) integer vector specifying the clustering, i.e.,
< code >pam(x,k, cluster.only=TRUE) is the same as
< code >pam(x,k)$clustering but computed more efficiently.
The < code >pam-algorithm is based on the search for < code >k
representative objects or medoids among the observations of the
dataset. These observations should represent the structure of the
data. After finding a set of < code >k medoids, < code >k clusters are
constructed by assigning each observation to the nearest medoid. The
goal is to find < code >k representative objects which minimize the sum
of the dissimilarities of the observations to their closest
representative object.
By default, when < code >medoids are not specified, the algorithm first
looks for a good initial set of medoids (this is called the
< b >build phase). Then it finds a local minimum for the
objective function, that is, a solution such that there is no single
switch of an observation with a medoid that will decrease the
objective (this is called the < b >swap phase).
When the < code >medoids are specified, their order does < em >not matter; in general, the algorithms have been designed to not depend on the order of the observations.
The < code >pamonce option, new in cluster 1.14.2 (Jan. 2012), has been proposed by Matthias Studer, University of Geneva, based on the findings by Reynolds et al. (2006).
The default < code >FALSE (or integer < code >0) corresponds to the original “swap” algorithm, whereas < code >pamonce = 1 (or < code >TRUE), corresponds to the first proposal .... and < code >pamonce = 2 additionally implements the second proposal as well.
an object of class < code >"pam" representing the clustering. See < code >?pam.object for details.
For large datasets, < code >pam may need too much memory or too much computation time since both are O(n^2). Then, < code >clara() is preferable, see its documentation.
There is hard limit currently, n <= 65536, at 2^{16} because for larger n, n(n-1)/2 is larger than the maximal integer (< code >.Machine$integer.max = 2^{31} - 1).
Kaufman and Rousseeuw's orginal Fortran code was translated to C
and augmented in several ways, e.g. to allow < code >cluster.only=TRUE
or < code >do.swap=FALSE, by Martin Maechler.
Matthias Studer, Univ.Geneva provided the < code >pamonce
implementation.
Reynolds, A., Richards, G., de la Iglesia, B. and Rayward-Smith, V. (1992) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms; < em >Journal of Mathematical Modelling and Algorithms < b >5, 475–504 ( http://dx.doi.org/10.1007/s10852-005-9022-1 ).
< code >agnes for background and references; < code >pam.object, < code >clara, < code >daisy, < code >partition.object, < code >plot.partition, < code >dist.
## generate 25 objects, divided into 2 clusters. x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)), cbind(rnorm(15,5,0.5), rnorm(15,5,0.5))) pamx <- pam(x, 2) pamx # Medoids: '7' and '25' ... summary(pamx) plot(pamx) ## use obs. 1 & 16 as starting medoids -- same result (typically) (p2m <- pam(x, 2, medoids = c(1,16))) ## no _build_ *and* no _swap_ phase: just cluster all obs. around (1, 16): p2.s <- pam(x, 2, medoids = c(1,16), do.swap = FALSE) p2.s p3m <- pam(x, 3, trace = 2) ## rather stupid initial medoids: (p3m. <- pam(x, 3, medoids = 3:1, trace = 1)) pam(daisy(x, metric = "manhattan"), 2, diss = TRUE) data(ruspini) ## Plot similar to Figure 4 in Stryuf et al (1996) ## Not run: plot(pam(ruspini, 4), ask = TRUE)