[R] best selection of covariates (for each individual)
rdiaz at cnio.es
Wed Jun 19 10:42:45 CEST 2002
This is not strictly R related (though I would implement the solution in R;
besides, being this list so helpful for these kinds of stats questions...).
I got a "strange" request from a colleage. He has a bunch (approx. 25000)
subjects that belong to one of 12 possible classes. In addition, there are 8
covariates (factors) that can take as values either "absence" or "presence".
Some of the subjects only have one covariate with value "presence" (the other
covariates being absent), but many of the subjects have more than one
covariate with a value of "presence".
My colleague wants each subject with more than one "presence" covariate to
have one, and only one, covariate be "presence" (of course, the final
"present" covariate would belong to the original "present" covars. for that
subject): in other words, each subject would be characterized by only one
covariate. This "selection of covariates for each subject" (or eliminating
covariates for each subject) has to be done in a way that maximizes the
correct classification of class based on the presence/absence of covariates.
(His reason for doing this is that this simplifies further analyses and
decission-making; I tried to explain that with 12 classes and 8 covariates
where each subject only has one "presence" covar we would not be able to do a
great job predicting class memebership, but he insists the
one-covar-per-subject is essential).
I thought about a couple of approaches (see details below) but none seem very
satisfactory. This issue keeps reminding me of things such as the LASSO and
other shrinkage methods, but the twist here is that it is not the beta for a
covariate, but different covars in each subject which are made zero.
Is there any obvious solution I am missing? Any suggestions?
Approach 1: the final statistic to judge predictive quality is Goodman &
Kruskal's tau (or concentration coefficient) for IxJ contingency tables.
Since for every subject with m "present" covars, there are m possible
contingency tables, and there are many subjects with multiple present covars,
there is an astronomical number of possible contingency tables, and we can
not do an exahustively search (nor do I see an obvious way to simplify the
problem from tau's definition, because we have 12 categories to predict based
on the 8 covars). I would use a genetic algorithm to try and find a decent
Approach 2: set this up as a multinomial loglinear model. Fit it (using
multinom) to the original data set. Do not make the covars as factors but
code present as 1 and absent as 0.
For each subject with several (say, k) "present" covars, predict the class
membership (predict.multinom) for each of the k covar. vectors obtained after
subtracting, say, 0.1, from each of the covariates (except 1) with value
non-zero. Set as the new covariate vector for that subject the one that gives
the highest predicted probability to the right class.
Repeat the model fitting and modify covariates as in the last step
(re-escaling at the end, so that the max. covar. value is always one for each
subject) until there is only one non-zero covar. (If there ever is!).
This seems to me like a very clumsy approach, and I am not sure if there is
any reason for it to arrive at a reasonable solution; I thought it could be a
way of smoothly moving, within subject, each covariate (except one) "along
its path of least resistance" to a value of zero.
(Note: in both approaches further simplification can be achieved by applying
the same transformation or mutation ---with ga--- to all subjects that belong
to the same class and have the same initial configuration of covariates. This
way I also forcefully prevent identical subjects to end up with different
Unidad de Bioinformática
Centro Nacional de Investigaciones Oncológicas (CNIO)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
More information about the R-help