[R-sig-ME] missing data + explanatory variables

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Thu Apr 1 00:04:11 CEST 2010

Dear list, dear Christophe,

Le mardi 30 mars 2010 à 10:57 +0200, christophe dutang a écrit :
> Dear list,
> Emmanuel, thank you for your long answer.
> See below,

Fair warning : this is another long response, probably unuseful and
tedious to most list's members. But I remember researching this kind of
questions, and tried to answer the way I would have liked to be answered
back then...

Christophe, if you have further questions, please fee free to mail me at
ChaRpEnT [at] BAcbUc <dot> DYNdns <dot> orG (all lowercase of course :
the brokenCamelBackCapItalizaTion being a feeble attempt at foiling
spammers' spiders...).

I take the liberty to [ Snip... ] what has already been said...

> Just to be sure to understand what you are talking about. Repeated-measures
> ANOVA are modelled by
> Y_it = mu + alpha_i + e_it ?

That is correct for a one-way, UNrepeated-mesures ANOVA. But it is much
better to use vector/matrix notation and save the indices to denote
various variables.

Furthermore, the distinction between class variable (qualitative data)
and quantitative variables is somewhat artificial : class variables with
n levels are (behind the scene) replaced with n-1 semi-quantitative
variables (0 or 1), the exact transformation depending of a choice on

Please see MASS 4 for further (much needed) explanations on this

> > Mixed-model ANOVA : manually intractable, accepts unbalanced datasets but
> > does not allow for partial observations, therefore forces you to ignore
> > incomplete *observations*. Modern solution, but does not accounts for
> > possible bias due to *missing* *data*.
> >
> Y_it = Z_it . alpha_i + X_it . beta + e_it

That would be one way to denote a mixed model, the point being that both
e_it and Z_it are considered source of randomness (i. e. the "Z" are
*not* directly reproductible, their values are not of direct interest,
but they are interesting in partially explaining away part of the
variability of Y (=Z.alpha+e) and allowing to reduce the estimation of
the residual variance to e, thus  allowing for more powerful comparisons
on X.

You'd better read the lmer vignettes and, even better, both Pinhero &
Bates (2000) and the draft chapters of Bates (201x) that Douglas Bates
offered for comment a short while ago... A full explanation of this
subject is *long* and implies difficult concepts, both probabilistic and
geometrical (linear algebra being of a big help for bridging the

Furthermore, I'm not a tenured statistics professor (I'm a dental
surgeon :-) among other things...).

> Is that correct? by the way, why we call them ANOVA models? (in my
> scolarship, anova was introduced for anova table.)

ANOVA (ANalysis Of VAriance), ANCOVA (ANalysis Of COVariance) and other
*algorithms'* names are a bit of a leftover of manual computation times,
where very different algorithms were necessary to treat each special
case of decomposition of the total variance in "factors" and "residual"
parts. See for example Winer (1957, IIRC) if you want a nice example of
what it meant in those days to (literally) analyze variance in various
models, and have pleasant nightmares ...

Nowadays, modern software allows you to think in the overall framework
(the linear model) and does the button-sorting for you. lm() and lmer()
(and, by the way, lme()) do this kind of sorting, the second allowing to
decompose the "natural" variability according to more than one source of

You should really have at least  peek at MASS4 and the references it
points to for better explanations.

On the other hand, the "anova table" concept is, indeed, a byproduct of
the anova, but is also a way to summarize the result of a linear model
analysis, in breaking the total variability of Y between "interesting"
sources (the factor(s)) and the "residual" variability (resulting of
"natural variability").

[ Digression : why this "anova table" allows you to "test" your factor
effects, and why this table is no longer deemed the necessary endpoint
of a linear model analysis ]

This breakup allows you to compare the between-factor-levels and the
"residual" variability : the formal ANOVA test (Fisher's test, alias
F-test) I was taught compares this residual variability to another
estimation of the natural variability, indirectly derived from the
differences between group means.  If the average difference between
group means is zero (that is you null hypothesis), the ratio of these
two quantities will "naturally" fluctuate about 1 ((following a F
density) ; if this latter estimate is much greater than the residual
variability (i. e. the ratio being "significantly" greater than 1), you
can reject the null hypothesis.

It can be shown that this test is related to what is known as "Wald"
tests (grossly : the ratio of an effect estimate to the "natural"
variability estimate under the null hypothesis of no effect, leading to
rejection of the null hypothesis if the ratio is "too different from
0"), whose easiest, best-known simple case is the Student's t test.

Other test families do exist, based on the assessment of the shape of
the likelihood of the sample as expressed by you model ; the score test
is based on the size (the norm) of likelihood's firs derivative at the
point representing the null hypothesis ; the likelihood ratio test
(which is "optimal" in various cases, according to the dreaded
Neyman-Pearson lemma) is based on the ratio of the likelihood of the
sample computed according to two *nested* models, the second (simpler)
one representing the "null hypothesis" imposed on the first.

This idea of model comparison", is much more general than the
special-case "analysis of variance" idea, and can be applied in very
many situations ; it is therefore  much more important thing to
understand than ANOVA technicalities (but please understand that these
"technicalities" are nothing to sneer at, and that Douglas
Bates' (enormous) work is key to be able those "simple" ideas to
complicated situations)

Reference : any good and recent mathematical statistics handbook. My
personal references (Kendall & Stuart 1957, Cox & Hinckley 1974) are
neither fun nor very pedagogic, and suppose a non-negligible deal of
probability theory. Williams (2000 ?) is far from being a bad book, but
I didn't really peruse it and will refrain from expressing a "hard"

See also MASS4 and Bill Venables' "Exegeses on the linear
model" (available on Brian Ripley's Web page) on what model comparisons
are legitimate (= meaningful).

I will refrain to comment further on the issue of the "necessity" of
hypothesis testing, which is a horse of another color (and very
different feathers :-)...

[ End of digression. ]

[ and another Snip ... ]

> Do you have any good reference on this topic?

That's a hard one ! So far, I've seen various introductions to the
subject, and most of them presuppose a reader already familiar with
"classical" procedures (and ogten working knowledge of linear algebra
and not-so-trivial calculus).

Williams (2000 ?) is  general introduction that gives a good deal of
explanation to Bayesian thinking, but is terse on practical issues.
Gelman & al (2004) is a very good textbook on Bayesian data analysis,
supposes some knowledge of probability theory, works up long non-trivial
examples using mostly direct R programing (and gives very interesting,
sometimes downright intriguing exercises), but skims on practical tools,
relying mostly on R. Albert (2007) is  much shorter book, quite
readable, introduces important concepts very well but does not delve in
deep theoretical development. As Gelman & al. (2004), it relies almost
exclusively on direct R computation.

On the other hand, Gelman & Hill (2007 ?) is  *very* good *modern* book
on regression, which introduces fixed-effects linear models, then
mixed-effects models, then Bayesian modeling for regression, using both
R and BUGS. I *strongly* recommend reading this book if you have
anything to do with regression, but be prepaed to dive in Gelman & al
(2004) if you want further explanation on Bayesian thinking. Be also
prepared to read Frank Harrell's (2002) book, not only as an antidote,
but also to grasp important concepts in regression modeling.

But none of these books will give you what Spiegelhalter's WinBUGS
examples volumes (now 3 of them, plus 2 for the old command-line BUGS
examples) may easily convey : the extraordinary flexibility of BUGS as a
modeling language. These examples are a sufficient reason to download
WinBUGS and start playing with it. I do not appreciate much the
graphical interface, but the facility to interact with a model and
progressively enhance it has few equivalent (R + Emacs + ESS + rjags +
JAGS, perhaps ... and you still need Spiegelhalter's examples !).

The report Spiegelhalter & his colleagues prepared in 2002 for the
British MRC should also be required reading if your field has anything
to do with biostatistics : it provides you a very good view of *wht*
Bayesian analysis can do in this field (while being quite terse on *how*
to do it...). ISTR that this report was later (2004 ?) expanded into a

>                                                a good implementation of this
> approach is the MCMCpack?

No. MCMCpack does indeed model the dependent variable and gives you
Bayesian estimates of your parameters (and does this efficiently), and
allows for direct (= Bayes factor) model comparison, but do not model
the covariates, which is the key point if you have to impute them. As
far as I can tell, it also eliminates data points with missing dependent
variables (for lm()'s compatibility's sake ?). Last time I looked, it
did not have general purpose multilevel (=mixed-model) functions.

However, MCMCpack implements interesting models for the social sciences,
which might be useful if they are also relevant to your field.

As far as I know, the only multiple imputation package that uses
Bayesian methods is mi (it's probably not by chance that this package is
from Gelman and his colleagues...). It is tempting ... But (again !) I
did not (yet) explore it sufficiently to give an *enlightened* opinion.

Trying to "roll your own" imputation and analysis model(s) in BUGS would
probably difficult (and *quite* time consuming), but it *certainly*
would be educational... and fun !

> > HTH,
> >
> > Emmanuel Charpentier
> >
> > PS : since "manual" algorithms are out of practical use since the end of
> > the 70s and the inception of what was then called "personal computers", I'm
> > a bit surprised that a paper published in 2004 still invokes that issue...
> > Is your domain special (or especially conservative) ?
> >
> I'm working in actuarial science. But I had just googled the paper and have
> no particular link to that domain.

Aha ! Then you might try to buttonhole Vincent Goulet, which is
(associate ?) professor of Actuarial Sciences in Canada. He is most
certainly better than me at practical issues (he is part of the gang
that maintains the 64-bit Debian and Ubuntu ports of R-CRAN, which is
quite a feat....), and indisputably know infinitely more than I do about
actuarial statistics.


					Emmanuel Charpentier

More information about the R-sig-mixed-models mailing list