[R] Hints for Data Mining
Lorenzo Isella
lorenzo.isella at gmail.com
Thu Sep 15 01:11:24 CEST 2011
Dear All,
I am recycling a previous email of mine where I asked some questions
about clustering mixed numerical/categorical data. This time I am more
into data mining. I am given a set of known statistical indexes {s_i},
i=1,2...N for a N countries. These indexes in general are a both
numerical and categorical variables. For each country, I also have a
property x_i whose value is known, but that I also would like to be able
to predict correctly using a model. This is needed in order to assess
the importance of the various indexes in determining {x_i}.
There are two cases of interest
(1) all the {x_i} are numerical variables, e.g. the average life expectancy
(2) all the {x_i} are categorical variables (e.g. the fact that the
country joins treaty A, B or C). This reminds me of discrete choice models.
Any suggestions about how to tackle this problems? In the past I used
mclust, but it is limited to all the {s_i} being numerical variables.
I saw an example of the use of glm for predicting binary variables
http://www.ats.ucla.edu/stat/R/dae/probit.htm
which may be relevant for (2). In general I know that some people use
Weka for this sort of tasks, but I wonder if I can use R to get a
decision tree and a confusion matrix and to be able to predict how the
{x_i} would change by varying the value of one statistical index.
Many thanks for your suggestions
Lorenzo
More information about the R-help
mailing list