[R-sig-ME] Teaching Mixed Effects

Sat Jan 24 02:50:53 CET 2009

Douglas, your extensive illuminating commentary, added
to questions and comment from list members, makes this
a marvelous list to read and learn from!  It would be good
to get such lists started for other parts of R.  (or are there
other comparable lists, with running highly expert
commentary laid on?)

Whether testing for random effects is legitimate depends
on the purpose.  What I consider wrong-headed is to test as
a preliminary to calculating SEs and/or tests for fixed effects.
I leave aside the question of whether one should be testing
at all.

I've just now posted a revised version of overheads from a
talk I gave a few months ago on multilevel models in R.
As often, I was over-ambitious in what I tried to cover.
Later slides may however be interesting for poring over at
one's leisure.  For inclusion on my web page, they would
benefit from a bit more commentary, which I will try to add
in due course.

http://www.maths.anu.edu.au/~johnm/r-book/2edn/xtras/mlm-ohp.pdf

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 23/01/2009, at 9:40 AM, Douglas Bates wrote:

> My thanks to Andrew for starting this thread and to Ben for his
> responses.  I will add more responses below.
>
> I'm sorry that my responses have been delayed - it happens that this
> is the first week of our spring semester and I have been tied up
> getting courses underway.
>
> On Tue, Jan 20, 2009 at 8:59 AM, Bolker,Benjamin Michael <bolker at ufl.edu 
> > wrote:
>>  Some comments (others will certainly have more to add)
>> ________________________________________
>> From: r-sig-mixed-models-bounces at r-project.org [r-sig-mixed-models-bounces at r-project.org 
>> ] On Behalf Of Andrew Beckerman [a.beckerman at sheffield.ac.uk]
>> Sent: Tuesday, January 20, 2009 6:47 AM
>> To: R Models Mixed
>> Subject: [R-sig-ME] Teaching Mixed Effects
>>
>> Dear R-Mixed people -
>>
>> I am about to embark on a day of attempting to teach some aspects of
>> mixed models using R to PhD students.  I was wondering if anyone  
>> would
>> be willing to indulge in this summary below, developed through  
>> reading
>> threads on R-Mixed and R-Help over the past few months, and vet my
>> list of issues/questions/topics (4)  associated with mixed models?
>>
>> Let me reduce any rising blood pressure by saying that I understand
>> (possibly) and accept why there are no p-values in lmer, and NONE of
>> the comments/questions below are about why lmer does not produce
>> sensible df's and p-values to calculate significance (Phew).
>>
>> #######################
>>
>> First, a technical question:
>>
>> Based on these two threads:
>> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q4/001459.html
>> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q4/001456.html
>>
>> IS mcmcsamp() broken for "complicated" random effects? Is it in good
>> enough shape to teach for "simple" gaussian mixed models, to
>> demonstrate the principle?
>>
>> BMB: good question. I'd like to know, although I would
>> guess that it would be OK for demonstrating the principle.
>> You could try it out and see if it's sensible ...
>
> I have not verified the results from the current mcmcsamp, even for
> simple Gaussian models.  They seem reasonable for these models but I
> need to look at them much more closely before I could advise trusting
> those results.
>
> The problem with designing an mcmcsamp method is that the variances of
> the random effects can legitimately be zero and often have a
> non-negligible probability of assuming the value of zero during the
> MCMC iteraions.  However, most methods of sampling from the
> distribution of a variance are based on sampling from the distribution
> of a multiplier of the current value.  If the current value is zero,
> you end up stuck there.
>
>> #######################
>>
>> Now, here is what I am possibly going to talk about.....
>>
>> 0) Rule number 1 is to design experiments well, and aim for
>> orthogonal, well replicated and  balanced designs.  If you get data
>> that conforms to all of that, old school F-ratio's CAN be used.  If
>> not, see 1-4 below (we will assume that Rule number 1 will be  
>> broken).
>
>> BMB: good idea.
>
> In a designed experiment you can do this.  In an observational study
> you can't expect to end up with balanced data.  I also tell my
> students that assuming a balanced design will always produce balanced
> data is contrary to Murphy's Law.  Balance is fragile.  I think it is
> unwise to depend on balance being achieved.
>
>> 1) It is agreed that the Laplacian methods for estimating terms and
>> "likelihoods" in mixed effects models is considered most reliable
>> (accurate and precise).
>>
>> BMB:  I believe (but stand ready to be corrected) that
>> PQL vs Laplace vs adaptive Gauss-Hermite quadrature
>> (AGHQ) is an issue for GLMMs, not so much for LMMs.
>> AGHQ is generally even better (but slower) than Laplace,
>> which is a special case.
>
> Exactly.  Adaptive Gauss-Hermite quadrature uses a quadrature formula
> centered at the conditional modes of the random effects and scaled by
> an approximation to the conditional standard deviations.  (That's
> where the term "adaptive" comes from.)  The quadrature formula depends
> on the number of quadrature points.  Generally we use an odd number of
> points so one of the evaluations is at the conditional mode.  Thus you
> could use a 3-point evaluation or a 5-point evaluation.  The simplest
> formula, the 1-point Gauss-Hermite evaluation, is exactly the Laplace
> approximation.
>
> It turns out that the Laplace approximation is the only feasible such
> method (at least the only one that I can imagine) when you have more
> than one grouping factor for the random effects.  Even with just one
> grouping factor, if you have vector-valued random effects (meaning
> that you have more than one random effect associated with with each
> level of the grouping factor) then the complexity of AGHQ is the
> number of quadrature points raised to the dimension of the
> vector-valued random effect.  Thus if you have a random intercept and
> a random slope for each subject, say, and you choose a 7 point formula
> then you must do 49 evaluations of the deviance residuals for each
> AGHQ evaluation.
>
> Oliver Schaubenberger's paper on SAS PROC GLIMMIX, which Ben mentions
> below, discusses the problem of proliferation of the number of
> evaluations required by AGHQ.
>
>> R (lme4) and ADMB model builder use these
>> methods. SAS nlmixed does, but SAS proc mixed does not appear to.
>> STATA can.  Genstat does/can (see final note below**).
>>
>> BMB: SAS PROC MIXED does, as of version 9.2
>> see www2.sas.com/proceedings/forum2007/177-2007.pdf
>
> I think you mean SAS PROC GLIMMIX, not SAS PROC MIXED.
>
> Regarding the comparison of methods provided by different software,
> remember that details of the implementation can be important.  The
> term "adaptive Gauss-Hermite quadrature" describes a technical
> approach but there can be many variations on the implementation, with
> important consequences for precision, speed and accuracy.  It is a
> gross oversimplification to imagine that such a technique is
> implemented by handing a paper with some formulas to a "programmer"
> and declaring the job done.  Comparing implementations, including
> looking at the intermediate steps, is important but only possible for
> open source implementations.
>
>> 2) It is agreed that the appropriate test for fixed effects in mixed
>> models should be between nested models.  However, there is no
>> agreement as how to characterise the distributions that would be used
>> to generate p-values.  This is the crux of the Bates et al argument:
>> Likelihood Ratio Tests, Wald tests etc all need to assume a
>> distribtion and some degrees of freedom.  But, in many mixed models,
>> the distribution need not conform to any of our standard ones (t,F,
>> Chi-square etc), especially when the number of subjects in the random
>> effects is small.  Moreover, the relationship between fixed and  
>> random
>> effects means that it is nearly impossible, and perhaps not  
>> worthwhile
>> to calcuate what might be appropriate "degrees of freedom".
>
> I'm afraid my response will be rather long-winded, for which I
> apologize.  I feel that as statisticians we have done a disservice to
> those who use statistics by taking complex problems (tests of
> hypotheses for terms in complicated model structures) for which there
> is an enormous simplification in the central special case (linear
> dependence of the mean on the model parameters, "spherical" Gaussian
> distribution) and describing certain computational shortcuts that can
> be used only in this special case.  For the most part we don't even
> hint at the general problem or even discuss the approach used in the
> special case.  We jump right in to a particular form of a summary of a
> computational shortcut, the analysis of variance table.
>
> I am very pleased to see you describe the situation as a comparison of
> two nested models.  That is indeed how we should approach the problem.
> We have a general model and a specific model that is a special case
> of the general model.  Obviously the general model will fit the data
> at least as well as the more specialized model.  We wish to determine
> if the fit is sufficiently better to justify using the more general
> model.   To do so we must decide how to judge the extent to which the
> general model fits better and how much more complex it is, so we can
> form some kind of cost/benefit criterion.  Then we must assess the
> value of this test statistic by comparing it to a reference
> distribution.  In the special case of a Gaussian linear model it all
> works out beautifully and we can summarize the results very compactly
> with t statistics or F statistics and their degrees of freedom.  But
> this case is special.  It is misleading to believe that things will
> simplify like this in more general cases.
>
> Consider the question of how we measure the comparative complexity of
> two models. Typically we can measure the difference in the complexity
> of the models in terms of the number of additional parameters in the
> general model. In the central special case (Gaussian linear models)
> there is a geometric representation of the model where the general
> model corresponds to a linear subspace of the sample space and the
> specific model corresponds to a subspace contained in the general
> model.  The spherical Gaussian assumption (i.e. Gaussian distribution
> with variance-covariance of sigma^2 I, for which the contours of
> constant probability density are spheres) links the probability model
> to the Euclidean geometry of the space.  From those two assumptions
> the methods of testing a general linear hypothesis can be derived
> geometrically.  Gosset derived a statistic that is equivalent to the t
> statistic using analytic means but the current form of the t statistic
> and its relation to degrees of freedom came from Fisher and were based
> on his geometric insight (see the Wikipedia article on Gosset).  And,
> of course, Fisher was able to generalize the t distribution to the F
> distribution, again based on geometric principles.
>
> In the first chapter of our 1980 book "Nonlinear Regression Analysis
> and Its Applications" Don Watts and I illustrate the geometric
> approach to the t and F statistics for linear models.  Fisher's genius
> was to see that questions about comparative model fits, which are
> related to distances in the geometric representation, can be
> transformed into questions about angles, related to the ratios of
> distances or, equivalently, squared distances of orthogonal components
> of a response vector.  The use of the ratio allows one scale factor
> (variance component in statistical terminology) to be canceled out.
> That is, the null distribution of an F ratio can be expressed without
> needing to specify the unknown error variance.
>
> Regrettably, the "use a ratio to eliminate a variance component" trick
> only works once.  The first variance component is free but not
> subsequent ones.  If you have a perfectly balanced, orthogonal design
> then you can apply the trick multiple times by isolating certain
> orthogonal submodels and applying the trick within the submodel.  That
> is, you can use estimates from different "error strata" in ratios for
> different terms.  However, that approach based on certain mean squares
> and expected mean squares doesn't generalize well.  The
> computationally tractable approach to estimation of parameters in
> mixed models is maximum likelihood or REML estimation.
>
> The reason for my long-winded explanation is your saying " This is the
> crux of the Bates et al argument: Likelihood Ratio Tests, Wald tests
> etc all need to assume a distribution and some degrees of freedom.",
> which is a natural statement given the way that we teach analysis of
> variance.  We teach the "what" (i.e. create a table of sums of
> squares, degrees of freedom, mean squares, F ratios, p-values) and not
> the "why".  If you only see the "what" then it is natural to assume
> that there are some magical properties associated with sums of squares
> and degrees of freedom and all we need to do is to figure out which
> sums of squares and which degrees of freedom to use.  The magical
> properties are actually the simplified geometric representation
> (orthogonal linear subspaces, Euclidean geometry) that is unique to
> the Gaussian linear model.  The beauty of that model is that, no
> matter how complicated the representation of a test as a formula may
> be, the geometric representation is always the same, as the ratio of
> the normalized squared lengths of two orthogonal components of the
> response vector.
>
> When we step away from that Gaussian linear model the simplifications
> all break down.  I spent the early part of career thinking of what
> parts of the Gaussian linear model can be transferred over to the
> Gaussian nonlinear model and what that would mean for inference.  This
> is why I concentrated so much on the geometry of models based on the
> spherical Gaussian distribution.  Generalized linear models retain the
> linear predictor, transformed through an inverse link function to the
> appropriate scale for the mean, but allow for distributions other than
> the Gaussian.  They require another way of thinking.  I think of mixed
> models as being based on two random vectors, the response vector and
> the unobserved random effects.  In the case of a Gaussian linear mixed
> model the conditional distribution of the response, given the random
> effects, is a spherical Gaussian and the unconditional distribution of
> the random effects is multivariate Gaussian but our inferences require
> the marginal distribution of the response.  For a linear mixed model
> this is Gaussian but with more than one variance component.  Getting
> rid of just one variance component won't do, yet the t and F
> derivations depend strongly on just having one variance component that
> can be removed by considering a ratio.
>
> If we want to perform a hypothesis test related to a fixed-effects
> term in a mixed model (and, for the moment, I will not go into the
> question of whether statistical inferences should always be phrased as
> the result of hypothesis tests) I would claim we should start at the
> beginning, which is considering two models for the data at hand, one
> model being a special case of the other.  We need to decide how we
> measure the quality of the fit of the general model relative to the
> more specific model and how we measure the additional cost of the
> general model.  Then we need to formulate a test statistic.  If we are
> incredibly lucky, the null distribution of this test statistic will be
> well-defined (that is, it will not depend on the values of other,
> unknown parameters) and we can evaluate probabilities associated with
> it.  That does happen in the case of the Gaussian linear model.  I
> personally don't think it will be possible to possible to provide a
> general approach that isolates the effect of a fixed-effect term in a
> linear mixed model using a statistic that does not depend on the
> values of other parameters.  I would be delighted if someone can do it
> but I think there is too much that goes right in the case of the
> Gaussian linear model to expect that the same incredible
> simplifications will apply to other models.
>
> I don't feel that holy grail of inference in mixed effects models
> should be a magical formula for degrees of freedom to be associated
> with some ratio that looks like a t or an F statistic (despite the
> strongly held beliefs of those in the First Church of the
> Kenward-Roger Approximation). Certainly there has been a lot of
> statistical research related to approximating a difficult distribution
> by a more common distribution but I view this approach as belonging to
> an earlier era.  It is the approach embodied in software like SAS
> whose purpose often seems to be to evaluate difficult formulas and
> provide reams of output including every number that could possibly be
> of interest.  I think we should use the power of current and future
> computers to interact with and explore data and models for the data.
> MCMC is one way to do this.  In nonlinear regression Don and I
> advocated profiling the sum of squares function with respect to the
> values of individual parameters as another way of assessing the actual
> behavior of the model versus trying to formulate an approximation.
> I'm sure that creative people will come up with many other ways to use
> the power of computers to this end.  The point is to explore the
> actual behavior of the model/data combination, not to do just one fit,
> calculate a bunch of summary statistics, apply approximate
> distributions to get p-values and go home.
>
> If we want to generalize methods of inference we should consider the
> whole chain of reasoning that leads us to the result rather than
> concentrating only on the last step, which is "now that I have the
> value of a statistic how do I convert it to a p-value?" or, even more
> specifically, "I have calculated something that looks like a t-ratio
> so I am going to assume that its distribution is indeed a
> t-distribution which leaves me with only one question and that is on
> how many degrees of freedom".
>
> I appreciate that this is inconvenient to those applying such models
> to their data.  Philosophical discussions of the fundamentals of
> statistical inference are all well and good but when the referees on
> your paper say you have to provide a p-value for a particular term or
> tests, it is a practical matter, not an academic, theoretical debate.
> Those with sufficient energy and skill plus a stellar reputation as a
> researcher may be able to convince editors that p-values are not the
> "be all and end all" of data analysis - Reinhold Kleigl has been able
> to do this in some cases - but that is definitely not the path of
> least resistance.  The sad reality is that p-values have taken on a
> role as the coin of the realm in science that is far beyond what any
> statistician would have imagined.  (Apparently the default
> "significance level" of 5%, which is considered in some disciplines to
> be carved in stone, resulted from a casual comment by Fisher to the
> effect that he might regard an outcome that would be expected to occur
> less than, say, 1 time in 20 as "significant".)
>
> It is unhelpful of me not to provide p-values in the lmer summaries
> but I develop the software out of interest in doing it as well as I
> possibly can and not because someone assigns me a task to compute
> something.  I really don't know of a good way of assigning p-values to
> t-ratios or F-ratios so I don't.  I still report the ratio of the
> estimate divided by it standard error, and even call it a t-ratio,
> because I think it is informative.
>
>> BMB: and specifically, if you happily use the anova() method
>> to calculate LRT it will do so -- but Pinheiro and Bates 2000  
>> expressly
>> warn against the results/show that they can be "anticonservative"
>> for small sample sizes.  If you have huge sample sizes (i.e.
>> you're not a field ecologist) then LRT may be OK (I seem to remember
>> Bates using it without comment in an analysis of a (large)
>> Bangladeshi arsenic data set).
>
> I think that was data on artificial contraception use obtained as part
> of a fertility survey in Bangladesh.
>
> I feel that the likelihood ratio is a perfectly reasonable way of
> comparing two model fits where one is a special case of the other.  In
> fact, if the models have been fit by maximum likelihood, the
> likelihood ratio would, I think, be the first candidate for a test
> statistic.  The problem with likelihood ratio tests is not the
> likelihood ratio, per se -- it is converting the likelihood ratio to a
> p-value.  You need to be able to evaluate the distribution of the
> likelihood ratio under the null hypothesis.  The chi-square
> approximation to the distribution is exactly that - an approximation -
> and its validity depends on not testing at the boundary and on having
> a large sample, in some sense of the sample size.  If I were really
> interested in evaluating a p-value for the likelihood ratio I would
> probably try a parametric bootstrap to get a reference distribution.
>
>> 2.1) However, Bates et al have mentioned the restricted likelihood
>> ratio test.  There is a package in R implementing some of these tools
>> (RLRsim), but these appear to be limited to and or focused on tests  
>> of
>> random effects.
>>
>>  BMB: You can test fixed effects, apparently, but only *in  
>> combination with*
>> a test of the random effect (the null hypothesis is always a model
>> without random effects). Also limited to LMMs, and a single
>> random effect.
>>
>> 2.2) What some "other" packages do: SAS can produce wald tests and
>> LRT's if you want, and can implement the kenward-rogers adjustement.
>>
>> BMB: Kenward-Roger ! (not Rogers)
>>
>> There is some theory behind the K-R, but it is still not dealing with
>> the crux of the problem (see 2).  Genstat uses wald tests and warns
>> you that with small numbers of subjects, these are not reliable.
>> Genstat is also experimenting with HGLM by Nelder and Lee (see **)
>>
>> 2.3) "Testing" random effects is considered inappropriate (but see  
>> 2.1
>> for methods?).
>>
>> BMB: I don't think this is necessarily true. Admittedly it is a point
>> null hypothesis (variance will never be _exactly_ zero), but I
>> can certainly see cases ("does variation among species contribute
>> significantly to the overall variance observed"?) where one would
>> want to test this question.  This is a bit murky but I think the
>> distinction is often between random effects as part of an  
>> experimental
>> design (no point in testing, not interesting) and random effects
>> as observational data.
>
> Actually the MLE or the REML estimate of a variance component can
> indeed be zero.  The residual variance (i.e. the variance of the "per
> observation" noise term) is only zero for artificial data but the
> estimates of other variance components can be exactly zero.
>
> I think of such situations as an indication that I should simplify the
> model by eliminating such a term.  I have spent most of my career at
> the University of Wisconsin-Madison in a statistics department founded
> by George Box who famously said "All models are wrong; some models are
> useful."  I don't expect a model to be correct, I am only interested
> in whether the terms in the model are useful in explaining the
> observed data.
>
>> 3) Because of 2, there is the resounding argument that bayesian and  
>> or
>> simulation/bootstrapping tools be used to evalaute fixed effects.
>>
>>  BMB: I don't know about "resounding", but OK.  Probably
>> the best option.
>>
>> Current methods proposed and coded, but in various states of
>> usefulness are:
>>
>> mcmcsamp() and HPDinterval() from lme4 + baayen *.fnc's from  
>> languageR,
>> BUGS and winBugs,
>> RH Baayen's simulation tools (e.g. page 307 method)
>> Andrew Gelman and Jennifer Hill's tools (e.g. sim() method from
>> package arm)
>> Ben Bolker's suggestions in this list for glmm's (thread: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q4/001459.html)
>>
>> 3.1) These all evalaute "simple" tests of whether beta's and  
>> intercept
>> are different than 0, and are linked to the contrasts.  There is no
>> emerging method equivalent to a LRT (but see 2.1 and **Final Note
>> Below).
>>
>> BMB: I think if you want to calculate the parametric bootstrap/
>> null-hypothesis simulation of the change in deviances between
>> nested models, it's actually reasonably straightforward. See
>> examples on glmm.wikidot.com , especially
>> http://glmm.wikidot.com/basic-glmm-simulation
>> [geared toward GLMMs but even easier for LMMs]
>>
>> 4) Andrew Gelman et al also suggest AIC based methods and model
>> averaging for model inference, given constant random effects.  I  
>> think
>> their argument about AIC is that if the "likelihood" is estimated
>> well, relative differences in AIC will be constant, irrespective of
>> any adjustement made to numbers of paramters used in calculating AIC:
>> i.e. as long as the random effects structure stays the same, the
>> relative differences between nested models will not change if the
>> number of paramters is estimated consistently. These methods still do
>> not produce p-values.
>>
>> BMB: and AIC is an asymptotic method anyway, like
>> LRTs ... which means it is likely to have the same problems, but I  
>> don't
>> think anyone has evaluated them.  If you want to use the finite-size
>> corrections (AICc) then you are back in the situation of guessing
>> at residual degrees of freedom ...
>>
>> **Final Note Below - I have noticed a relative lack of discussion of
>> Nelder et al's  H-likelihood and their methods to generate a general
>> method for all heirarchical modelling (HGLM?!).  Would anybody be  
>> able
>> to comment?  A recent paper (http://www.springerlink.com/content/17p17r046lx4053r/fulltext.pdf
>> ) that is somewhat beyond my skills, indicates the use of Laplace
>> methods to estimate likelihoods in heirarchical models and various
>> capacity for model inference.
>>
>>  BMB: I think HGLMs are a very promising way of *estimating*
>> the parameters of GLMMs (I too wonder why they don't seem to
>> be discussed much), but I don't think they get us any farther forward
>> with inference.
>>
>> Thanks again, in advance, to anyone who took this on..... apologies
>> for any glaring errors or assignment of ideas to people incorrectly.
>>
>> Andrew
>>
>> ---------------------------------------------------------------------------------
>> Dr. Andrew Beckerman
>> Department of Animal and Plant Sciences, University of Sheffield,
>> Alfred Denny Building, Western Bank, Sheffield S10 2TN, UK
>> ph +44 (0)114 222 0026; fx +44 (0)114 222 0002
>> http://www.beckslab.staff.shef.ac.uk/
>>
>> http://www.flickr.com/photos/apbeckerman/
>> http://www.warblefly.co.uk
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models