Greetings, listserv members:

I am involved in the analysis of factors predictive of whether a person that dies in a hospital becomes a transplant organ donor.  To do this analysis, with the help of the NCHS we have linked the list of all organ donors over a seven year period with information of all US deaths over this period obtained from death certificates.  As you might imagine, this is a rather "big data" analysis, with nearly 40,000 donors among about 2,500,000 deaths.

There is also a very large number of ICD-9 codes (and other information) listed in the death certificates.  We anticipate that we will need to reduce the dimensionality of the problem for it to become practical, let alone intelligible, and we are planning to use the grpreg (in R) package to do a two level selection of the most relevant covariates.  But our data also have a nested structure in terms of the US geographical areas of interest -- US counties within the designated service areas of the OPOs (organ procurement organization).   I am not aware of a package that deals simultaneously with covariate selection (a la glmnet or similar packages) and mixed modeling.  I am addressing this e-mail to you all as folks that are expert in the issue of mixed models.

I have read that in fitting a mixed model, one fits first the fixed effects, and then looks for additional explanatory structure among the random effects.  This has suggested to me that one could approach the above problem in a two step manner, first reducing the dimensionality of the problem and deriving coefficients from the glmnet-type analysis, and then doing a mixed model analysis on the residuals from the above.

So the basic question is whether something along the above lines makes sense.  I would deeply appreciate any suggestions or pointers to relevant literature that I could use to understand all this better.

Many thanks in advance for your help.

