[R] non-uniqueness in cluster analysis

Christian Hennig fm3a004 at math.uni-hamburg.de
Wed Dec 3 16:19:42 CET 2003


Hi,

Brian Ripley already replied "don't use average linkage"... You
may think about k-medoid (pam) in package cluster instead.
However, often average linkage is not such a bad choice, and if you really
want to use it for your data, you may try the following:
Among the hierarchical methods, single linkage has the smallest problem
with equal distances, because possible agglomerations based on equal
distances between clusters are all carried out regardless of the order.
If at some step the smallest between cluster-distance 
is d(a,b)=d(a,c)<d(b,c), it may happen that a and b are merged first, or
a and c are merged first, but before merging anything else with distance
larger than d(a,b), a, b *and* c are merged. Thus, you have order
dependence only between the steps where you merge clusters with the same
distance, but not afterwards.

If your problem occurs only at a low level of agglomeration
(and you don't have
situations where d(a,b) and d(a,c) are small and d(b,c) is very large; I 
do not know if the triangle inequality holds for your data), you may do
some first steps with Single Linkage and then continue with average
linkage (I haven't thought about if this can be done in R without extra
effort). 

But if you have already observed that the averarge linkage outcome depends
critically (from the viewpoint of interpretation) on the order of points, 
then it seems that you are in an unstable situation, if you are able to
define a unique clustering or not.

Christian

On Wed, 3 Dec 2003, Bruno Giordano wrote:

> Hi,
> I'm clustering objects defined by categorical variables with a hierarchical
> algorithm - average linkage.
> My distance matrix (general dissimilarity coefficient) includes several
> distances with exactly the same values.
> As I see, a standard agglomerative procedure ignores this problems, simply
> selecting, above equal distances, the one that comes first.
> For this reason the analysis in output depends strongly on the orderings of
> the objects within the raw data matrix.
> Is there a standard procedure to deal with this?
> Thanks
>     Bruno
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 

***********************************************************************
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag-online.de




More information about the R-help mailing list