[R] algorithm for clustering categorical data
Martin Maechler
maechler at stat.math.ethz.ch
Tue Aug 6 18:04:11 CEST 2013
>>>>> "DC" == David Carlson <dcarlson at tamu.edu>
>>>>> on Tue, 6 Aug 2013 10:26:56 -0500 writes:
> What do you mean by representing the categorical fields by 1:k?
> a <- c("red", "green", "blue", "orange", "yellow")
> becomes
> a <- c(1, 2, 3, 4, 5)
> That guarantees your results are worthless
worthless indeed!
> unless your categories
> have an inherent order (e.g. tiny, small, medium, big, giant).
> Otherwise it should be four (k-1) indicator/dummy variables (e.g.):
> a.red <- c(1, 0, 0, 0, 0)
> a.green <- c(0, 1, 0, 0, 0)
> a.blue <- c(0, 0, 1, 0, 0)
> a.orange <- c(0, 0, 0, 1, 0)
> Then you can use Euclidean distance.
Yes, ... or use Gower's or other similarly sophisticated
distances, as you (David) mentioned earlier in this thread.
Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :
require("cluster")
?daisy
## notably daisy(*, metric="gower")
Note that daisy() is more sophisticated than most users know,
using the 'type = *' specification allowing, notably for binary
variables (as your a.<col> dummies above) allowing asymmetric
behavior which maybe quite important in "rare event" and similar
cases.
Martin
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> -----Original Message-----
> From: Li, Yan [mailto:Yan_Li at ibi.com]
> Sent: Tuesday, August 6, 2013 9:36 AM
> To: dcarlson at tamu.edu; r-help at r-project.org
> Subject: RE: [R] algorithm for clustering categorical data
> H David and other R helpers,
> If I rescale the numerical fields to [0,1] and represent the
> categorical fields to 1:k, which is the same starting point as
> Gower's measure, but I use Euclidean distance instead of Gower's
> distance to do k-means clustering. How much is the difference? What
> is the draw back?
> Thanks you,
> Yan
> -----Original Message-----
> From: David Carlson [mailto:dcarlson at tamu.edu]
> Sent: Thursday, August 01, 2013 12:08 PM
> To: Li, Yan; r-help at r-project.org
> Subject: RE: [R] algorithm for clustering categorical data
> Read up on Gower's Distance measures (available in the ecodist
> package) which can combine numeric and categorical data. You didn't
> give us any information about how you numerically transformed the
> categorical variables, but the usual approach is to create indicator
> variables that code presence/absence for each category within a
> categorical variable. Different variances between variables can be
> reduced by standardizing the variables.
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
> Sent: Thursday, August 1, 2013 11:00 AM
> To: r-help at r-project.org
> Subject: [R] algorithm for clustering categorical data
> Hi All,
> Does anyone know what algorithm for clustering categorical
> variables? R packages? Which is the best?
> If a data has both numeric and categorical data, what is the best
> clustering algorithm to use and R package?
> I tried numeric transformation of all categorical fields and doing
> clustering afterwards. But the transformed fields have values from
> 1...10, and my other fields is in a bigger scale:
> 10000-...This will make the categorical fields has less effect on
> the distance calculation...
> Thank you!
> Yan
More information about the R-help
mailing list