[R] distance in the function kmeans
Jari Oksanen
jari.oksanen at oulu.fi
Sat May 29 07:53:11 CEST 2004
My thread broke as I write this at home and there were no new messages
on this subject after I got home. I hope this still reaches interested
parties.
There are several methods that find centroids (means) from distance
data. Centroid clustering methods do so, and so does classic scaling
a.k.a. metric multidimensional scaling a.k.a. principal co-ordinates
analysis (in R function cmdscale the means are found in C function
dblcen.c in R sources). Strictly this centroid finding only works with
Euclidean distances, but these methods willingly handle any other
dissimilarities (or distances). Sometimes this results in anomalies
like upper levels being below lower levels in cluster diagrams or in
negative eigenvalues in cmdscale. In principle, kmeans could do the
same if she only wanted.
Is it correct to use non-Euclidean dissimilarities when Euclidean
distances were assumed? In my field (ecology) we know that Euclidean
distances are often poor, and some other dissimilarities have better
properties, and I think it is OK to break the rules (or `violate the
assumptions'). Now we don't know what kind of dissimilarities were used
in the original post (I think I never saw this specified), so we don't
know if they can be euclidized directly using ideas of Petzold or
Simpson. They might be semimetric or other sinful dissimilarities, too.
These would be bad in the sense Uwe Ligges wrote: you wouldn't get
centres of Voronoi polygons in original space, not even non-overlapping
polygons. Still they might work better than the original space (who
wants to be in the original space when there are better spaces floating
around?)
The following trick handles the problem euclidizing space implied by
any dissimilarity meaasure (metric or semimetric). Here mdata is your
original (rectangular) data matrix, and dis is any dissimilarity data:
tmp <- cmdscale(dis, k=min(dim(mdata))-1, eig=TRUE)
eucspace <- tmp$points[, tmp$eig > 0.01]
The condition removes axes with negative or almost-zero eigenvalues
that you will get with semimetric dissimilarities.
Then just call kmeans with eucspace as argument. If your dis is
Euclidean, this is only a rotation and kmeans of eucspace and mdata
should be equal. For other types of dis (even for semimetric
dissimilarity) this maps your dissimilarities onto Euclidean space
which in effect is the same as performing kmeans with your original
dissimilarity.
Cheers, jari oksanen
--
Jari Oksanen, Oulu, Finland
More information about the R-help
mailing list