clara {cluster}  R Documentation 
Clustering Large Applications
Description
Computes a "clara"
object, a list
representing a
clustering of the data into k
clusters.
Usage
clara(x, k, metric = c("euclidean", "manhattan", "jaccard"),
stand = FALSE, cluster.only = FALSE, samples = 5,
sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
keep.data = medoids.x, rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
Arguments
x 
data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric (or logical). Missing values (NAs) are allowed. 
k 
integer, the number of clusters.
It is required that 
metric 
character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean", "manhattan", "jaccard". Euclidean distances are root sumofsquares of differences, and manhattan distances are the sum of absolute differences. 
stand 
logical, indicating if the measurements in 
cluster.only 
logical; if true, only the clustering will be computed and returned, see details. 
samples 
integer, say 
sampsize 
integer, say 
trace 
integer indicating a trace level for diagnostic output during the algorithm. 
medoids.x 
logical indicating if the medoids should be
returned, identically to some rows of the input data 
keep.data 
logical indicating if the (scaled if

rngR 
logical indicating if R's random number generator should
be used instead of the primitive clara()builtin one. If true, this
also means that each call to 
pamLike 
logical indicating if the “swap” phase (see

correct.d 
logical or integer indicating that—only in the case
of Because the new correct formula is not back compatible, for the time
being, a warning is signalled in this case, unless the user explicitly
specifies 
Details
clara
(for "euclidean" and "manhattan") is fully described in
chapter 3 of Kaufman and Rousseeuw (1990).
Compared to other partitioning methods such as pam
, it can deal with
much larger datasets. Internally, this is achieved by considering
subdatasets of fixed size (sampsize
) such that the time and
storage requirements become linear in n
rather than quadratic.
Each subdataset is partitioned into k
clusters using the same
algorithm as in pam
.
Once k
representative objects have been selected from the
subdataset, each observation of the entire dataset is assigned
to the nearest medoid.
The mean (equivalent to the sum) of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The subdataset for which the mean (or sum) is minimal, is retained. A further analysis is carried out on the final partition.
Each subdataset is forced to contain the medoids obtained from the
best subdataset until then. Randomly drawn observations are added to
this set until sampsize
has been reached.
When cluster.only
is true, the result is simply a (possibly
named) integer vector specifying the clustering, i.e.,
clara(x,k, cluster.only=TRUE)
is the same as
clara(x,k)$clustering
but computed more efficiently.
Value
If cluster.only
is false (as by default),
an object of class "clara"
representing the clustering. See
clara.object
for details.
If cluster.only
is true, the result is the "clustering", an
integer vector of length n
with entries from 1:k
.
Note
By default, the random sampling is implemented with a very
simple scheme (with period 2^{16} = 65536
) inside the Fortran
code, independently of R's random number generation, and as a matter
of fact, deterministically. Alternatively, we recommend setting
rngR = TRUE
which uses R's random number generators. Then,
clara()
results are made reproducible typically by using
set.seed()
before calling clara
.
The storage requirement of clara
computation (for small
k
) is about
O(n \times p) + O(j^2)
where
j = \code{sampsize}
, and (n,p) = \code{dim(x)}
.
The CPU computing time (again assuming small k
) is about
O(n \times p \times j^2 \times N)
, where
N = \code{samples}
.
For “small” datasets, the function pam
can be used
directly. What can be considered small, is really a function
of available computing power, both memory (RAM) and speed.
Originally (1990), “small” meant less than 100 observations;
in 1997, the authors said “small (say with fewer than 200
observations)”; as of 2006, you can use pam
with
several thousand observations.
Author(s)
Kaufman and Rousseeuw (see agnes
), originally.
Metric "jaccard"
: Kamil Kozlowski (@ownedoutcomes.com
)
and Kamil Jadeszko.
All arguments from trace
on, and most R documentation and all
tests by Martin Maechler.
See Also
agnes
for background and references;
clara.object
, pam
,
partition.object
, plot.partition
.
Examples
## generate 500 objects, divided into 2 clusters.
x < rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax < clara(x, 2, samples=50)
clarax
clarax$clusinfo
## using pamLike=TRUE gives the same (apart from the 'call'):
all.equal(clarax[8],
clara(x, 2, samples=50, pamLike = TRUE)[8])
plot(clarax)
## cluster.only = TRUE  save some memory/time :
clclus < clara(x, 2, samples=50, cluster.only = TRUE)
stopifnot(identical(clclus, clarax$clustering))
## 'xclara' is an artificial data set with 3 clusters of 1000 bivariate
## objects each.
data(xclara)
(clx3 < clara(xclara, 3))
## "better" number of samples
cl.3 < clara(xclara, 3, samples=100)
## but that did not change the result here:
stopifnot(cl.3$clustering == clx3$clustering)
## Plot similar to Figure 5 in Struyf et al (1996)
## Not run: plot(clx3, ask = TRUE)
## Try 100 times *different* random samples  for reliability:
nSim < 100
nCl < 3 # = no.classes
set.seed(421)# (reproducibility)
cl < matrix(NA,nrow(xclara), nSim)
for(i in 1:nSim)
cl[,i] < clara(xclara, nCl, medoids.x = FALSE, rngR = TRUE)$cluster
tcl < apply(cl,1, tabulate, nbins = nCl)
## those that are not always in same cluster (5 out of 3000 for this seed):
(iDoubt < which(apply(tcl,2, function(n) all(n < nSim))))
if(length(iDoubt)) { # (not for all seeds)
tabD < tcl[,iDoubt, drop=FALSE]
dimnames(tabD) < list(cluster = paste(1:nCl), obs = format(iDoubt))
t(tabD) # how many times in which clusters
}