[R] standardization of values before call to pam() or clara()
Martin Maechler
maechler at stat.math.ethz.ch
Sat Jun 3 14:19:39 CEST 2006
>>>>> "Dylan" == Dylan Beaudette <dylan.beaudette at gmail.com>
>>>>> on Mon, 22 May 2006 17:33:47 -0700 writes:
Dylan> Greetings, Experimenting with the cluster package,
Dylan> and am starting to scratch my head in regards to the
Dylan> *best* way to standardize my data. Both functions can
Dylan> pre-standardize columns in a dataframe. according to
Dylan> the manual:
Dylan> Measurements are standardized for each variable
Dylan> (column), by subtracting the variable's mean value
Dylan> and dividing by the variable's mean absolute
Dylan> deviation.
Dylan> This works well when input variables are all in the
Dylan> same units. When I include new variables with a
Dylan> different intrinsic range, the ones with the largest
Dylan> relative values tend to be _weighted_ . this is
Dylan> certainly not surprising, but complicates things.
Dylan> Does there exist a robust technique to effectively
Dylan> re-scale each of the variables, regardless of their
Dylan> intrinsic range to some set range, say from {0,1} ?
Dylan> I have tried dividing a variable by the maximum value
Dylan> of that variable, but I am not sure if this is
Dylan> statistically correct.
A more usual scaling standardization is accomplished by the
function -- guess what? -- scale()
It defaults to standardize to mean 0 and std. 1.
But you can use it as well to do a [0,1] scaling.
Note that you are very wise to think about the importance of
variable scaling / weighting for cluster analysis.
But people have been "here" before, and invented the much more
general notion of a distance/dissimilarity between observational
units.
--> function daisy() {in "cluster"} or dist() {from "stats"}
provide such dissimilarity objects.
These can be used as input for pam() or clara() as well,
and in constructing them you are much more flexible than trying
to find a proper scaling of your x-matrix.
Note that daisy() in particular has been designed for computing
sensible dissimilarities for the case when X-matrix has a
collection of continuous {eg "interval scaled"} and of
categorical (e.g binary) variables.
I recommend you get a textbook on clustering, to read up more on
the subject.
Regards,
Martin Maechler, ETH Zurich
Dylan> Any ideas, thoughts would be greatly appreciated.
Dylan> Cheers,
Dylan> -- Dylan Beaudette Soils and Biogeochemistry Graduate
Dylan> Group University of California at Davis 530.754.7341
More information about the R-help
mailing list