[R] clustering on scaled dataset or not?

Claudia Beleites cbeleites at units.it
Thu Oct 28 23:23:34 CEST 2010


> Hi, just a general question: when we do hierarchical clustering, should we
> compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?

> daisy() in cluster package allow standardizing the variables before calculating
> dissimilarity matrix;

I'd say that should depend on your data.

- if your data is all (physically) different kinds of things (and thus 
different orders of magnitude), then you should probably scale.

- On the other hand, I cluster spectra. Thus my variates are all the 
same unit, and moreover I'd be afraid that scaling would blow up 
noise-only variates (i.e. the spectra do have low or no intensity 
regions), thus I usually don't scale.

- It also depends on your distance. E.g. Mahalanobis should do the 
scaling by itself, if think correctly at this time of the day...

What I do frequently, though, is subtracting something like the minimum 
spectrum (in practice, I calculate the 5th percentile for each variate - 
it's less noisy). You can also center, but I'm strongly for having a 
physical meaning, and for my samples that's the minimum spectrum is 
better interpretable (it represents the matrix composition).

> but dist() doesn't have that option at all. Appreciate if
> you can share your thoughts?
but you could call scale () and then dist ().


> Thanks
> John
> 	[[alternative HTML version deleted]]
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list