[R-sig-ME] missing data + explanatory variables

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Thu Mar 25 00:58:44 CET 2010


I have been fighting (part of) these questions, too. Below some partial
an temporary answers.

Le mercredi 24 mars 2010 à 12:09 +0100, christophe dutang a écrit :
> Dear list,
> 
> I have two problems when I try to use mixed models. First as far as I know,
> there are two main implementations of mixed models: lme4 and MCMCglmm.
> 
> I try to model a binary response variable over a small period of time. The
> problem is that for some lines, the response is missing. In this mailing
> list archive, I do not find response to this question.
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q4/002940.htmlproposes
> the MCMCglmm but I check the package and missing data are not
> handled
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q3/002579.html, no
> solution
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001794.htmlproposes
> the EM algorithm to solve the problem
> 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001188.html says
> missing data in covariates can be handled directly by R or SAS.
> 
> As I'm a beginner in the use of mixed models, I spent a lot of times reading
> on this topic. And in the book of Edward Frees 'longitudinal and panel
> data', estimation seems to be available for missing data? And there is an
> article promoting the use of mixed model for that feature:
> 'A Comparison of the General Linear Mixed Model and Repeated Measures ANOVA
> Using a Dataset with Multiple Missing Data Points'

I'd be interested in references for this paper. Google led me to :

@article{krueger_tian_2004, title={A Comparison of the General Linear
Mixed Model and Repeated Measures ANOVA Using a Dataset with Multiple
Missing Data Points}, volume={6}, DOI={10.1177/1099800404267682},
number={2}, journal={Biol Res Nurs}, author={Krueger, Charlene and Tian,
Lili}, year={2004}, month={Oct}, pages={151-157}} ,

and it does not seem to be available in any of my country's (France)
academic libraries : the closest sources I can get are in Germany or
UK...

I'm a bit surprised by the abstract note (edited out above). Any model
can (theoretically) be used to impute missing data ; theoretical work of
Rubin and followers have shown that a) this was desirable, since
discarding incomplete observations ("complete-cases analysis") would
lead in many cases to biased estimates, b) such imputation could be done
"semi-automatically" in two special cases ("missing completely at
random" and "missing at random" data (in other cases, one must *also*
model the missing data mechanism), by specifying a distribution for
variables with missing data, c) that such imputation should be done at
least  few times in order to estimate the excess variability that this
estimation procedure adds to estimators, and d) that such multiple
imputations should be combined by incorporation of this excess
variability.

Mixed models explicitly model part of the inter-individual variation, by
splitting it between between-groups and within-group variabilities.
Therefore, imputation might be more precise. But they do not, *by
themselves*, allow for imputation.

Bayesian estimation of mixed model parameters, which is becoming popular
thanks to the BUGS language and recent textbooks, such as Gelman & Hill
(2007) (recommended reading, even if you are no bound to adhere to all
conclusions...), sort-of facilitates this imputation process, but only
in the sense that it is (relatively) easy to specify (at least in BUGS)
a (reasonable) a priori distribution for any variable (but it is not
always easy to obtain and assess numerical convergence...).

Other solutions have been proposed. At least three package aim at
proposing a "reasonable" imputation model for missing data : AmeliaII
(which I did not fully explore, since it is a bit "closed"), mice (which
is quite "open", but whose current version (2.3) has some problems in
specifying specialized imputation functions) and mi (Gelman & his gang,
close to the ideas of the textbook mentioned above), which I did not yet
explore fully, seems interesting but a bit awkward if you need to use
specialized imputation functions.

Various packages allow for estimation from a multiply-imputed dataset :
mice and mi, of course, , but also mitools and (reportedly) Zelig. The
only one that tries to implement hypothesis testing (e. g. test for the
"significance" of a whole factor) is mice. But I'm less and less
convinced of the necessity of such a procedure, notwithstanding journal
editor's wishes, whims and tantrums (see below).

Bayesian estimation "automatically" incorporate missing data's added
uncertainty, since it uses a full probability model for imputing them
(it can also incorportate relevant prior information if available, which
could be extremely valuable but might invalidate your imputations from a
"strict frequentist" point of view). But it can do so only if it is
build for using and modeling all data. Some specialized software, such
as MCMCpack, are built on the assumption of a complete dataset and
assumptions of a predefined shape of a priori distributions of the
variables (quite often a so-called 'uninformative' distribution). If you
have indeed missing covariates, it won't model it and exclude the
relevant observations, thus leading to the same problems that plague
"complete-cases analysis".

Similarly, lmer, as far as I can tell, does assume some shape of the
group-level coefficients and fo the distribution of the dependant
variable given its predictors, but won't impute missing data. Ditto, as
far as I can tell, for MCMCglmm.


> So I do not understand why we can't have missing data?

Because there is no probability model for imputation (nor an estimation
combining procedure) in (g)lmer nor in MCMCglmm.

> Secondly, the presentation of D. Bates done at Max Planck Institute in 2009
> states that p-values are not available for mixed models because the
> distribution of parameter estimators are not known.
> 
> My question is how can we know that an explanatory variable is significant?
> is the only tool to fit another model without the variable and to use the
> anova function?

Because 1) you did not specify *what* is a "significant variable" :-),
and 2) the exact distribution of the possible "test statistics" (Wald ?
Score ? Likelihood ratios ? Ad-hoc relevant statistic ?)  are not known
at least in the general case. The simplified models that were proposed
about 50 years ago for the (very) special case of *balanced* datasets
resulting from *designed* experiments (implemented in the aov()
function) do not hold for the ((much) more) general case (g)lmer aims to
implement.

Look in the R-help archives for a long discussion of the problem of such
hypothesis testing. Douglas Bates stated (rightly, IMNSHO) that
reproducing "what SAS does" was *not*, to his eyes, a good enough reason
to implement it, and explained (some of) his misgivings. See also his
book (Pinhero & bates (2000)) which gives good examples of the problem
(met with nlme, predecessor to lme4).

The proposed solution is to use MCMC sampling from the distributions
proposed as a solution by (g)lmer, and use this as a basis for taking
such decisions (that is what hypothesis testing aims to do).

The fly in the ointment is that, as far as I can tell, the relevant
functions are currently *broken* in current "stable" and "development"
versions of lme4 (and not yet written for some non-Gaussian cases). This
has been discussed on this list.

But the crux of the matter is that the second, technical point might be
not as important as the first : barrels of ink were spent discussing the
*epistemological* status of "significance" in hypothesis testing, which
became "standard operating procedure" probably for reasons having little
relevance to sound epistemology. Nowadays, we use electrons, but the
pendulum seems to be starting in the other direction : confidence
interval estimation is now often regarded as a better indication of the
importance of your findings than a "p-value".

A lot more could  be written about the use and misuse of hypothesis
testing, and even much more bout the (possible) relevance of Bayesian
analysis of multilevel models and its interpretation, but "this is
another story" an I've probably been already too long... A look at the
relevant literature should keep you amused, sometimes bored, but anyway
busy for quite a bit of time:-).

HTH,

					Emmanuel Charpentier, DDS, MSc

> Thanks in advance
> 
> Christophe
> 




More information about the R-sig-mixed-models mailing list