[R-sig-ME] Best way to handle missing data?
Joseph Bulbulia
joseph.bulbulia at me.com
Mon Mar 2 13:03:49 CET 2015
RELATED QUESTION
I have a related and probably naive question, but raising it might be helpful to Bonnie and others (myself included) who are struggling with multiple-imputaton in a mixed-effects modeling setting.
FIRST, MY DISCOMFORT
The question arises from (1) my discomfort with averaging across multiply imputed datasets, which seems to lose the uncertainty from the data-generating imputation process (2) my need to use a wider class of models than is made available by Zelig — such as MCMCglmm.
NOTE
I realise that MCMCglmm can handle missing variables (MAR) as outcome variables, but where many columns have missing values, the resulting multivariate outcome model will often becomes overly complex.
THE QUESTION
To avoid averaging, if multiple data sets were generated (assume sensibly) through a multiple imputation algorithm (say using the Amelia package), would it make any sense to combine the datasets (e.g. using r-bind) with an indicator for each of the imputed datasets, and then to model each specific imputed dataset as a random effect in, say, MCMCglmm?
REASONING
If the observations from the datasets were conceived as measurements on individuals (also included as an effect modelled as random). Then conceptually it seems you would be adjusting your expectation for the variation of multiple observations within individuals from the multiply imputed datasets. Where there is no imputation, the observed values remain constant, and part of me thinks this constancy of observations within individuals shouldn’t effect the estimates... I think?
SNAG
On the other hand, just combining datasets with an indicator for each dataset would artificially (and often dramatically) increase the number of observations, which might not be handled adequately by the G/ R structures.
APOLOGY
I apologise if this question makes little sense, or if the answer is just plain obvious. I’d intended to ask a statistician at work, and to simulate some data with him, but the topic came up here, and I figured others might benefit, in case others had the same (potentially naive) thought, and the experts have a quick answer, even if the answer is “you are muddled.”
Cheers,
Joseph
> On 2/03/2015, at 2:29 pm, David Duffy <David.Duffy at qimr.edu.au> wrote:
>
> On Mon, 2 Mar 2015, Bonnie Dixon wrote:
>
>> I don't think the model I am working on is a good candidate for structural
>> equation modeling because the data set is very unbalanced (ie. there are
>> very different numbers of observations for different people, taken at
>> different times), the main relationship of interest involves a time-varying
>> predictor, and one of the variables with missing data is not continuous (it
>> is a binary, categorical variable). So, I will stick with the multiple
>> imputation approach for handling the missing data.
>
> As Wolfgang mentioned, OpenMX can fit a FIML analysis to irregular data. If you were, for example, interested in a profile likelihood around a variance component, that might be the way to go. It seems to me that multiple imputation might not always respect complicated clustering/correlation, depending on the actual method. A quick search found some cautionary tales in:
>
> http://www.bmj.com/content/338/bmj.b2393.extract
>
> Just another 2c, David.
>
>
> | David Duffy (MBBS PhD)
> | email: David.Duffy at qimrberghofer.edu.au ph: INT+61+7+3362-0217 fax: -0101
> | Genetic Epidemiology, QIMR Berghofer Institute of Medical Research
> | 300 Herston Rd, Brisbane, Queensland 4006, Australia GPG 4D0B994A
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list