[R] algorithm for clustering categorical data

Tue Aug 6 20:43:59 CEST 2013

Thanks for the reply...

For some reason, I need to keep Euclidean distance in the process...

-----Original Message-----
From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
Sent: Tuesday, August 06, 2013 12:04 PM
To: dcarlson at tamu.edu
Cc: Li, Yan; r-help at r-project.org
Subject: Re: [R] algorithm for clustering categorical data

>>>>> "DC" == David Carlson <dcarlson at tamu.edu>
>>>>>     on Tue, 6 Aug 2013 10:26:56 -0500 writes:

    > What do you mean by representing the categorical fields by 1:k?
    > a <- c("red", "green", "blue", "orange", "yellow")

    > becomes

    > a <- c(1, 2, 3, 4, 5)

    > That guarantees your results are worthless worthless indeed!

    > unless your categories
    > have an inherent order (e.g. tiny, small, medium, big, giant).
    > Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

    > a.red <- c(1, 0, 0, 0, 0)
    > a.green <- c(0, 1, 0, 0, 0)
    > a.blue <- c(0, 0, 1, 0, 0)
    > a.orange <- c(0, 0, 0, 1, 0)

    > Then you can use Euclidean distance.

Yes, ... or use Gower's or other similarly sophisticated distances, as you (David) mentioned earlier in this thread.

Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  require("cluster")
  ?daisy
  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a.<col> dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

Martin

    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352

    > -----Original Message-----
    > From: Li, Yan [mailto:Yan_Li at ibi.com] 
    > Sent: Tuesday, August 6, 2013 9:36 AM
    > To: dcarlson at tamu.edu; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > H David and other R helpers,

    > If I rescale the numerical fields to [0,1] and represent the
    > categorical fields to 1:k, which is the same starting point as
    > Gower's measure, but I use Euclidean distance instead of Gower's
    > distance to do k-means clustering. How much is the difference? What
    > is the draw back? 

    > Thanks you,
    > Yan

    > -----Original Message-----
    > From: David Carlson [mailto:dcarlson at tamu.edu] 
    > Sent: Thursday, August 01, 2013 12:08 PM
    > To: Li, Yan; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > Read up on Gower's Distance measures (available in the ecodist
    > package) which can combine numeric and categorical data. You didn't
    > give us any information about how you numerically transformed the
    > categorical variables, but the usual approach is to create indicator
    > variables that code presence/absence for each category within a
    > categorical variable. Different variances between variables can be
    > reduced by standardizing the variables.

    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352

    > -----Original Message-----
    > From: r-help-bounces at r-project.org
    > [mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
    > Sent: Thursday, August 1, 2013 11:00 AM
    > To: r-help at r-project.org
    > Subject: [R] algorithm for clustering categorical data

    > Hi All,

    > Does anyone know what algorithm for clustering categorical
    > variables? R packages? Which is the best?

    > If a data has both numeric and categorical data, what is the best
    > clustering algorithm to use and R package?

    > I tried numeric transformation of all categorical fields  and doing
    > clustering afterwards. But the transformed fields have values from
    > 1...10, and my other fields is in a bigger scale:
    > 10000-...This will make the categorical fields has less effect on
    > the distance calculation...

    > Thank you!
    > Yan