[R] algorithm for clustering categorical data

David Carlson dcarlson at tamu.edu
Tue Aug 6 17:26:56 CEST 2013


What do you mean by representing the categorical fields by 1:k?

a <- c("red", "green", "blue", "orange", "yellow")

becomes

a <- c(1, 2, 3, 4, 5)

That guarantees your results are worthless unless your categories
have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

a.red <- c(1, 0, 0, 0, 0)
a.green <- c(0, 1, 0, 0, 0)
a.blue <- c(0, 0, 1, 0, 0)
a.orange <- c(0, 0, 0, 1, 0)

Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: Li, Yan [mailto:Yan_Li at ibi.com] 
Sent: Tuesday, August 6, 2013 9:36 AM
To: dcarlson at tamu.edu; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the
categorical fields to 1:k, which is the same starting point as
Gower's measure, but I use Euclidean distance instead of Gower's
distance to do k-means clustering. How much is the difference? What
is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't
give us any information about how you numerically transformed the
categorical variables, but the usual approach is to create indicator
variables that code presence/absence for each category within a
categorical variable. Different variances between variables can be
reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical
variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best
clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing
clustering afterwards. But the transformed fields have values from
1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on
the distance calculation...

Thank you!
Yan

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list