[R] Hints for Data Clustering

Lorenzo Isella lorenzo.isella at gmail.com
Fri Sep 2 18:50:04 CEST 2011


Dear All,
I will be confronted (relatively soon) with the following problem:
given a set of known statistical indicators {s_i} , i=1,2...N for a N 
countries I would like to be able to do some data clustering i.e. 
determining the best way to partition the N countries according to their 
known properties, encoded by the {s_i} set of indicators for those 
countries.
Some properties of these countries may be categorical or anyway 
non-numerical variables (e.g. the fact of belonging/not belonging to a 
certain group; joining/not joining a certain treaty etc...). I have seen 
some data clustering examples, but without categorical variables and I 
wonder if this is an inherent limitation of the methodology (on the top 
of my head, I would not know how to define the distance between 
non-numerical variables). Any suggestions about the general methodology 
and R packages/code snippets is really appreciated.
And also: do the units in which I express a statistical indicator play a 
role? For instance: for 2 given countries I could have the average age 
of the population, the average life expectancy and the average income 
per year in thousands of dollars. This would give rise e.g. to 
(40,72,26) and (44,75,36), but if I measure the average income in 
dollars, then I would get (40,72,26000) (44,75,36000). Would the units 
that I choose for an indicator impact on the clustering results? They 
should not, in my view, since the income does not change whichever way I 
express it, but I am not sure about the algorithm results.
Many thanks

Lorenzo



More information about the R-help mailing list