[R] Hints for Data Clustering
Lorenzo Isella
lorenzo.isella at gmail.com
Fri Sep 2 18:50:04 CEST 2011
Dear All,
I will be confronted (relatively soon) with the following problem:
given a set of known statistical indicators {s_i} , i=1,2...N for a N
countries I would like to be able to do some data clustering i.e.
determining the best way to partition the N countries according to their
known properties, encoded by the {s_i} set of indicators for those
countries.
Some properties of these countries may be categorical or anyway
non-numerical variables (e.g. the fact of belonging/not belonging to a
certain group; joining/not joining a certain treaty etc...). I have seen
some data clustering examples, but without categorical variables and I
wonder if this is an inherent limitation of the methodology (on the top
of my head, I would not know how to define the distance between
non-numerical variables). Any suggestions about the general methodology
and R packages/code snippets is really appreciated.
And also: do the units in which I express a statistical indicator play a
role? For instance: for 2 given countries I could have the average age
of the population, the average life expectancy and the average income
per year in thousands of dollars. This would give rise e.g. to
(40,72,26) and (44,75,36), but if I measure the average income in
dollars, then I would get (40,72,26000) (44,75,36000). Would the units
that I choose for an indicator impact on the clustering results? They
should not, in my view, since the income does not change whichever way I
express it, but I am not sure about the algorithm results.
Many thanks
Lorenzo
More information about the R-help
mailing list