[R] clustering on scaled dataset or not?
cbeleites at units.it
Thu Oct 28 23:23:34 CEST 2010
> Hi, just a general question: when we do hierarchical clustering, should we
> compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?
> daisy() in cluster package allow standardizing the variables before calculating
> dissimilarity matrix;
I'd say that should depend on your data.
- if your data is all (physically) different kinds of things (and thus
different orders of magnitude), then you should probably scale.
- On the other hand, I cluster spectra. Thus my variates are all the
same unit, and moreover I'd be afraid that scaling would blow up
noise-only variates (i.e. the spectra do have low or no intensity
regions), thus I usually don't scale.
- It also depends on your distance. E.g. Mahalanobis should do the
scaling by itself, if think correctly at this time of the day...
What I do frequently, though, is subtracting something like the minimum
spectrum (in practice, I calculate the 5th percentile for each variate -
it's less noisy). You can also center, but I'm strongly for having a
physical meaning, and for my samples that's the minimum spectrum is
better interpretable (it represents the matrix composition).
> but dist() doesn't have that option at all. Appreciate if
> you can share your thoughts?
but you could call scale () and then dist ().
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help