[R-sig-ME] missing data + explanatory variables : important complement.

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Thu Mar 25 21:55:14 CET 2010


Some additions to what I wrote yesterday : I misunderstood the aim of
the paper as explained in its abstract, and started too far ...

Le mercredi 24 mars 2010 à 12:09 +0100, christophe dutang a écrit :
> Dear list,
> 
> I have two problems when I try to use mixed models. First as far as I know,
> there are two main implementations of mixed models: lme4 and MCMCglmm.
> 
> I try to model a binary response variable over a small period of time. The
> problem is that for some lines, the response is missing. In this mailing
> list archive, I do not find response to this question.
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q4/002940.htmlproposes
> the MCMCglmm but I check the package and missing data are not
> handled
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q3/002579.html, no
> solution
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001794.htmlproposes
> the EM algorithm to solve the problem
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001188.html says
> missing data in covariates can be handled directly by R or SAS.
> 
> As I'm a beginner in the use of mixed models, I spent a lot of times reading
> on this topic. And in the book of Edward Frees 'longitudinal and panel
> data', estimation seems to be available for missing data? And there is an
> article promoting the use of mixed model for that feature:
> 'A Comparison of the General Linear Mixed Model and Repeated Measures ANOVA
> Using a Dataset with Multiple Missing Data Points'
> 
> So I do not understand why we can't have missing data?

Okay. I have been able to lay my eyes on this article. The point it
makes is that the "mixed model" approach to analysing repeated
measurements on the same subject allows you to use observations made on
this subject having complete data, discarding only the *observations*
with missing data, whereas the old "repeated measures ANOVA" would lead
you to ignore *all* observations made on a *subject* having *one*
observation with missing data.

This is due to the fact that what the article calls the "repeated
measures ANOVA" algorithms were (geometrically very smart) *manually*
computable simplifications of a more general procedure involving, in the
general case, manually intractable computation (multiple matrix
inversions of not-so-small dimensions) ; these simplifications (and the
resulting algorithms) were valid only for at least partially *balanced*
datasets.

(g)lmer() (and its predecessor lme()) of course allow for this use of
data on incompletely documented *subjects*. This is so obvious to users
of "modern" software (i. e. using something else than transcriptions of
manual computation algorithms) that it is no longer mentioned as an
issue. The situation is similar to 2-way fixed-effects ANOVA, where the
"manual" algorithm I was taught ... too long a time ago *demanded* a
balanced datasets, which is a requirement that ended even with BMDP (end
of 70's, IIRC). Similarly, nowhere I'm aware of the authors of lm()
mention that multiway-models do not have to be balanced : that's
granted...

However, what the authors of that papers of yours do *not* mention is
that analyzing this incomplete dataset by using all the complete
*observations*, while better that using only complete *subjects*, might
still lead to biased estimates, and that debiasing them should involve
multiple imputations of missing data in incomplete *observations*. That
subject was well explored by Rubin (since the 80's IIRC), and involves
the specialized packages I mentioned yesterday ... or turning to
Bayesian estimation, which is a horse of an entirely different color.
Look up the 2nd edition of Rubin's book (ca 1998 IIRC) on the subject
for (much) better explanations.

To summarize :

Repeated-measures ANOVA : manually tractable, needs balanced datasets,
therefore forces you to ignore incomplete *subjects*. Obsolete (but
smart an historically important).

Mixed-model ANOVA : manually intractable, accepts unbalanced datasets
but does not allow for partial observations, therefore forces you to
ignore incomplete *observations*. Modern solution, but does not accounts
for possible bias due to missing data.

===== Your paper stops here (and I started here yesterday) =====

Multiple imputations + Mixed-model ANOVA : allows you to use all
available information, estimates the loss of information incurred by
missing data and attempts to make up for it. Best frequentist
(classical) solution, needs specialized software, and might require
special modeling efforts of the missing-data mechanism if "missing at
random" is not "obviously" reasonable.

Bayesian modeling : requires serious efforts to model both the
phenomenon of interest, its covariates and possibly the missing-data
mechanism and a priori information, needs some awareness of the
computational difficulties (not always solvable), current tools not yet
perfected, but is (theoretically) the best possible solution since it
attempts to model the joint distribution of all data, including their
missingness. The answers it leads to (distributions, credible intervals,
Bayes factors, probabilities) have intuitive meanings, quite different
of "frequentist" confidence intervals and p-values, and might not (yet)
be accepted in some circles insisting, for example, on hypothesis
testing.

HTH,

						Emmanuel Charpentier

PS : since "manual" algorithms are out of practical use since the end of
the 70s and the inception of what was then called "personal computers",
I'm a bit surprised that a paper published in 2004 still invokes that
issue... Is your domain special (or especially conservative) ?




More information about the R-sig-mixed-models mailing list