agnes { cluster }R Documentation

Agglomerative Nesting (Hierarchical Clustering)

Description

Computes agglomerative hierarchical clustering of the dataset.

Usage
agnes(x, diss = inherits(x, "dist"), metric = "euclidean",
      stand = FALSE, method = "average", par.method,
      keep.diss = n < 100, keep.data = !diss, trace.lev = 0)
Arguments
x

data matrix or data frame, or dissimilarity matrix, depending on the value of the < code >diss argument.

In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.

In case of a dissimilarity matrix, < code >x is typically the output of < code >daisy or < code >dist. Also a vector with length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed.

diss

logical flag: if TRUE (default for < code >dist or < code >dissimilarity objects), then < code >x is assumed to be a dissimilarity matrix. If FALSE, then < code >x is treated as a matrix of observations by variables.

metric

character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are < code >"euclidean" and < code >"manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If < code >x is already a dissimilarity matrix, then this argument will be ignored.

stand

logical flag: if TRUE, then the measurements in < code >x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If < code >x is already a dissimilarity matrix, then this argument will be ignored.

method

character string defining the clustering method. The six methods implemented are < code >"average" ([unweighted pair-]group [arithMetic] average method, aka ‘UPGMA’), < code >"single" (single linkage), < code >"complete" (complete linkage), < code >"ward" (Ward's method), < code >"weighted" (weighted average linkage, aka ‘WPGMA’), its generalization < code >"flexible" which uses (a constant version of) the Lance-Williams formula and the < code >par.method argument, and < code >"gaverage" a generalized < code >"average" aka “flexible UPGMA” method also using the Lance-Williams formula and < code >par.method.

The default is < code >"average".

par.method

If < code >method is < code >"flexible" or < code >"gaverage", a numeric vector of length 1, 3, or 4, (with a default for < code >"gaverage"), see in the details section.

keep.diss, keep.data

logicals indicating if the dissimilarities and/or input data < code >x should be kept in the result. Setting these to < code >FALSE can give much smaller results and hence even save memory allocation < em >time.

trace.lev

integer specifying a trace level for printing diagnostics during the algorithm. Default < code >0 does not print anything; higher values print increasingly more.

Details

< code >agnes is fully described in chapter 5 of Kaufman and Rousseeuw (1990). Compared to other agglomerative clustering methods such as < code >hclust, < code >agnes has the following features: (a) it yields the agglomerative coefficient (see < code >agnes.object) which measures the amount of clustering structure found; and (b) apart from the usual tree it also provides the banner, a novel graphical display (see < code >plot.agnes).

The < code >agnes-algorithm constructs a hierarchy of clusterings.
At first, each observation is a small cluster by itself. Clusters are merged until only one large cluster remains which contains all the observations. At each stage the two < em >nearest clusters are combined to form one larger cluster.

For < code >method="average", the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster.
In < code >method="single", we use the smallest dissimilarity between a point in the first cluster and a point in the second cluster (nearest neighbor method).
When < code >method="complete", we use the largest dissimilarity between a point in the first cluster and a point in the second cluster (furthest neighbor method).

The < code >method = "flexible" allows (and requires) more details: The Lance-Williams formula specifies how dissimilarities are computed when clusters are agglomerated (equation (32) in K&R(1990), p.237). If clusters C_1 and C_2 are agglomerated into a new cluster, the dissimilarity between their union and another cluster Q is given by

D(C_1 \cup C_2, Q) = α_1 * D(C_1, Q) + α_2 * D(C_2, Q) + β * D(C_1,C_2) + γ * |D(C_1, Q) - D(C_2, Q)|,

where the four coefficients (α_1, α_2, β, γ) are specified by the vector < code >par.method, either directly as vector of length 4, or (more conveniently) if < code >par.method is of length 1, say = α, < code >par.method is extended to give the “Flexible Strategy” (K&R(1990), p.236 f) with Lance-Williams coefficients (α_1 = α_2 = α, β = 1 - 2α, γ=0).
Also, if < code >length(par.method) == 3, γ = 0 is set.

< b >Care and expertise is probably needed when using < code >method = "flexible" particularly for the case when < code >par.method is specified of longer length than one. Since cluster version 2.0, choices leading to invalid < code >merge structures now signal an error (from the C code already). The < em >weighted average (< code >method="weighted") is the same as < code >method="flexible", par.method = 0.5. Further, < code >method= "single" is equivalent to < code >method="flexible", par.method = c(.5,.5,0,-.5), and < code >method="complete" is equivalent to < code >method="flexible", par.method = c(.5,.5,0,+.5).

The < code >method = "gaverage" is a generalization of < code >"average", aka “flexible UPGMA” method, and is (a generalization of the approach) detailed in Belbin et al. (1992). As < code >"flexible", it uses the Lance-Williams formula above for dissimilarity updating, but with α_1 and α_2 not constant, but < em >proportional to the < em >sizes n_1 and n_2 of the clusters C_1 and C_2 respectively, i.e,

α_j = α'_j * n_1/(n_1 + n_2),

where α'_1, α'_2 are determined from < code >par.method, either directly as (α_1, α_2, β, γ) or (α_1, α_2, β) with γ = 0, or (less flexibly, but more conveniently) as follows:

Belbin et al proposed “flexible beta”, i.e. the user would only specify β (as < code >par.method), sensibly in

-1 ≤ β < 1,

and β determines α'_1 and α'_2 as

α'_j = 1 - β,

and γ = 0.

This β may be specified by < code >par.method (as length 1 vector), and if < code >par.method is not specified, a default value of -0.1 is used, as Belbin et al recommend taking a β value around -0.1 as a general agglomerative hierarchical clustering strategy.

Note that < code >method = "gaverage", par.method = 0 (or < code >par.method = c(1,1,0,0)) is equivalent to the < code >agnes() default method < code >"average".

Value

an object of class < code >"agnes" (which extends < code >"twins") representing the clustering. See < code >agnes.object for details, and methods applicable.

BACKGROUND

Cluster analysis divides a dataset into groups (clusters) of observations that are similar to each other.

Hierarchical methods

like < code >agnes, < code >diana, and < code >mona construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of observations.

Partitioning methods

like < code >pam, < code >clara, and < code >fanny require that the number of clusters be given by the user.

Author(s)

Method < code >"gaverage" has been contributed by Pierre Roudier, Landcare Research, New Zealand.

References

Kaufman, L. and Rousseeuw, P.J. (1990). (=: “K&R(1990)”) < em >Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

Anja Struyf, Mia Hubert and Peter J. Rousseeuw (1996) Clustering in an Object-Oriented Environment. < em >Journal of Statistical Software < b >1. http://www.jstatsoft.org/v01/i04

Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997). Integrating Robust Clustering Techniques in S-PLUS, < em >Computational Statistics and Data Analysis, < b >26, 17–37.

Lance, G.N., and W.T. Williams (1966). A General Theory of Classifactory Sorting Strategies, I. Hierarchical Systems. < em >Computer J. < b >9, 373–380.

Belbin, L., Faith, D.P. and Milligan, G.W. (1992). A Comparison of Two Approaches to Beta-Flexible Clustering. < em >Multivariate Behavioral Research, < b >27, 417–433.

See Also

< code >agnes.object, < code >daisy, < code >diana, < code >dist, < code >hclust, < code >plot.agnes, < code >twins.object.

Examples
data(votes.repub)
agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE)
agn1
plot(agn1)

op <- par(mfrow=c(2,2))
agn2 <- agnes(daisy(votes.repub), diss = TRUE, method = "complete")
plot(agn2)
## alpha = 0.625 ==> beta = -1/4  is "recommended" by some
agnS <- agnes(votes.repub, method = "flexible", par.meth = 0.625)
plot(agnS)
par(op)

## "show" equivalence of three "flexible" special cases
d.vr <- daisy(votes.repub)
a.wgt  <- agnes(d.vr, method = "weighted")
a.sing <- agnes(d.vr, method = "single")
a.comp <- agnes(d.vr, method = "complete")
iC <- -(6:7) # not using 'call' and 'method' for comparisons
stopifnot(
  all.equal(a.wgt [iC], agnes(d.vr, method="flexible", par.method = 0.5)[iC])   ,
  all.equal(a.sing[iC], agnes(d.vr, method="flex", par.method= c(.5,.5,0, -.5))[iC]),
  all.equal(a.comp[iC], agnes(d.vr, method="flex", par.method= c(.5,.5,0, +.5))[iC]))

## Exploring the dendrogram structure
(d2 <- as.dendrogram(agn2)) # two main branches
d2[[1]] # the first branch
d2[[2]] # the 2nd one  { 8 + 42  = 50 }
d2[[1]][[1]]# first sub-branch of branch 1 .. and shorter form
identical(d2[[c(1,1)]],
          d2[[1]][[1]])
## a "textual picture" of the dendrogram :
str(d2)

data(agriculture)

## Plot similar to Figure 7 in ref
## Not run: plot(agnes(agriculture), ask = TRUE)


data(animals)
aa.a  <- agnes(animals) # default method = "average"
aa.ga <- agnes(animals, method = "gaverage")
op <- par(mfcol=1:2, mgp=c(1.5, 0.6, 0), mar=c(.1+ c(4,3,2,1)),
          cex.main=0.8)
plot(aa.a,  which.plot = 2)
plot(aa.ga, which.plot = 2)
par(op)


## Show how "gaverage" is a "generalized average":
aa.ga.0 <- agnes(animals, method = "gaverage", par.method = 0)
stopifnot(all.equal(aa.ga.0[iC], aa.a[iC]))

[ Package cluster version 2.0.7-1 Index ]