[R] daisy() for gower distance calculation
Martin Maechler
maechler at stat.math.ethz.ch
Mon Nov 20 10:16:07 CET 2006
>>>>> "Tyler" == Tyler Smith <tyler.smith at mail.mcgill.ca>
>>>>> on Sun, 19 Nov 2006 23:47:15 -0400 writes:
Tyler> Gavin Simpson wrote:
>> vegdist in package vegan has Gower's distance, but all
>> variables have to be numeric.
>> If you want to use mixed data (numerics, factors,
>> binary), see ?daisy in package cluster.
Tyler> This is a little unclear. vegdist will handle regular
Tyler> quantitative variables as well as binary
Tyler> variables. This is not so much a feature of vegdist
Tyler> as of the Gower similarity, which treats binary and
Tyler> quantitative variables identically, since a simple
Tyler> matching coefficient produces the same similarity
Tyler> value as is produced by Gower's quantitative
Tyler> similarity function for a variable that can take only
Tyler> two values.
Tyler> Perhaps that's what you meant, and I just
Tyler> misunderstood you. Perhaps I'm wrong, and someone
Tyler> will correct me!
Two things, not really a correction:
- daisy() is in Recommended package cluster which is part of every
R installation, so why not try it first?
- daisy() has been developed for and documented in the book by
Kaufman and Rousseeuw (1990). They have strived to be more flexible
than Gower's original proposal, and I (as maintainer of the
cluster 'package') had further tweaked the daisy() implementation.
It allows missing values (NAs)
and differentiates and hence allows to specify
the following 6--7 type of variables:
continuous: "interval-scaled", "ordratio", "logratio"
(where the last one just means to work on log()ed variables)
discrete:
asymmetric binary "A"
symmetric binary "S"
nominal "N" - (unordered) factor
ordered "O" - ordered factor
where all but the "*ratio" and binary types are determined by
default from the variables in the data frame.
For binary variables, using "symmetric" is effectively the
same as using "interval scaled" and this is used by default,
but the default now has been giving a warning to the user,
since the reference (and I) have been recommending to *think*
if *a*symmetric binary was not more appropriate {which it is
many cases in todays applicaitons}.
Regards,
Martin Maechler
More information about the R-help
mailing list