pam {cluster} | R Documentation |

Partitioning (clustering) of the data into `k`

clusters “around
medoids”, a more robust version of K-means.

pam(x, k, diss = inherits(x, "dist"), metric = c("euclidean", "manhattan"), medoids = NULL, stand = FALSE, cluster.only = FALSE, do.swap = TRUE, keep.diss = !diss && !cluster.only && n < 100, keep.data = !diss && !cluster.only, pamonce = FALSE, trace.lev = 0)

`x` |
data matrix or data frame, or dissimilarity matrix or object,
depending on the value of the In case of a matrix or data frame, each row corresponds to an
observation, and each column corresponds to a variable. All
variables must be numeric. Missing values ( In case of a dissimilarity matrix, |

`k` |
positive integer specifying the number of clusters, less than the number of observations. |

`diss` |
logical flag: if TRUE (default for |

`metric` |
character string specifying the metric to be used for calculating
dissimilarities between observations. |

`medoids` |
NULL (default) or length- |

`stand` |
logical; if true, the measurements in |

`cluster.only` |
logical; if true, only the clustering will be computed and returned, see details. |

`do.swap` |
logical indicating if the |

`keep.diss, keep.data` |
logicals indicating if the dissimilarities
and/or input data |

`pamonce` |
logical or integer in |

`trace.lev` |
integer specifying a trace level for printing
diagnostics during the build and swap phase of the algorithm.
Default |

The basic `pam`

algorithm is fully described in chapter 2 of
Kaufman and Rousseeuw(1990). Compared to the k-means approach in `kmeans`

, the
function `pam`

has the following features: (a) it also accepts a
dissimilarity matrix; (b) it is more robust because it minimizes a sum
of dissimilarities instead of a sum of squared euclidean distances;
(c) it provides a novel graphical display, the silhouette plot (see
`plot.partition`

) (d) it allows to select the number of clusters
using `mean(silhouette(pr)[, "sil_width"])`

on the result
`pr <- pam(..)`

, or directly its component
`pr$silinfo$avg.width`

, see also `pam.object`

.

When `cluster.only`

is true, the result is simply a (possibly
named) integer vector specifying the clustering, i.e.,

`pam(x,k, cluster.only=TRUE)`

is the same as

`pam(x,k)$clustering`

but computed more efficiently.

The `pam`

-algorithm is based on the search for `k`

representative objects or medoids among the observations of the
dataset. These observations should represent the structure of the
data. After finding a set of `k`

medoids, `k`

clusters are
constructed by assigning each observation to the nearest medoid. The
goal is to find `k`

representative objects which minimize the sum
of the dissimilarities of the observations to their closest
representative object.

By default, when `medoids`

are not specified, the algorithm first
looks for a good initial set of medoids (this is called the
**build** phase). Then it finds a local minimum for the
objective function, that is, a solution such that there is no single
switch of an observation with a medoid that will decrease the
objective (this is called the **swap** phase).

When the `medoids`

are specified, their order does *not*
matter; in general, the algorithms have been designed to not depend on
the order of the observations.

The `pamonce`

option, new in cluster 1.14.2 (Jan. 2012), has been
proposed by Matthias Studer, University of Geneva, based on the
findings by Reynolds et al. (2006).

The default `FALSE`

(or integer `0`

) corresponds to the
original “swap” algorithm, whereas `pamonce = 1`

(or
`TRUE`

), corresponds to the first proposal ....
and `pamonce = 2`

additionally implements the second proposal as
well.

an object of class `"pam"`

representing the clustering. See
`?pam.object`

for details.

For large datasets, `pam`

may need too much memory or too much
computation time since both are *O(n^2)*. Then,
`clara()`

is preferable, see its documentation.

There is hard limit currently, *n <= 65536*, at
*2^{16}* because for larger *n*, *n(n-1)/2* is larger than
the maximal integer (`.Machine$integer.max`

= *2^{31} - 1*).

Kaufman and Rousseeuw's orginal Fortran code was translated to C
and augmented in several ways, e.g. to allow `cluster.only=TRUE`

or `do.swap=FALSE`

, by Martin Maechler.

Matthias Studer, Univ.Geneva provided the `pamonce`

implementation.

Reynolds, A., Richards, G., de la Iglesia, B. and Rayward-Smith, V. (1992)
Clustering rules: A comparison of partitioning and hierarchical
clustering algorithms;
*Journal of Mathematical Modelling and Algorithms* **5**,
475–504 (http://dx.doi.org/10.1007/s10852-005-9022-1).

`agnes`

for background and references;
`pam.object`

, `clara`

, `daisy`

,
`partition.object`

, `plot.partition`

,
`dist`

.

## generate 25 objects, divided into 2 clusters. x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)), cbind(rnorm(15,5,0.5), rnorm(15,5,0.5))) pamx <- pam(x, 2) pamx # Medoids: '7' and '25' ... summary(pamx) plot(pamx) ## use obs. 1 & 16 as starting medoids -- same result (typically) (p2m <- pam(x, 2, medoids = c(1,16))) ## no _build_ *and* no _swap_ phase: just cluster all obs. around (1, 16): p2.s <- pam(x, 2, medoids = c(1,16), do.swap = FALSE) p2.s p3m <- pam(x, 3, trace = 2) ## rather stupid initial medoids: (p3m. <- pam(x, 3, medoids = 3:1, trace = 1)) pam(daisy(x, metric = "manhattan"), 2, diss = TRUE) data(ruspini) ## Plot similar to Figure 4 in Stryuf et al (1996) ## Not run: plot(pam(ruspini, 4), ask = TRUE)

[Package *cluster* version 2.0.7-1 Index]