[R-sig-ME] Teaching Mixed Effects

Fri Jan 30 19:22:44 CET 2009

The accumulation of quoted posts in this thread is quite long, so I 
have trimmed out all but some paragraphs from Doug Bates that prompt my 
comments.

> -----Original Message-----
> From: r-sig-mixed-models-bounces at r-project.org 
> [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf 
> Of Douglas Bates
> Sent: Thursday, January 22, 2009 5:40 PM
> To: Bolker,Benjamin Michael
> Cc: R Models Mixed
> Subject: Re: [R-sig-ME] Teaching Mixed Effects
> 
   [ ... ]
> 
> If we want to perform a hypothesis test related to a fixed-effects
> term in a mixed model (and, for the moment, I will not go into the
> question of whether statistical inferences should always be phrased as
> the result of hypothesis tests) I would claim we should start at the
> beginning, which is considering two models for the data at hand, one
> model being a special case of the other.  We need to decide how we
> measure the quality of the fit of the general model relative to the
> more specific model and how we measure the additional cost of the
> general model.  Then we need to formulate a test statistic.  If we are
> incredibly lucky, the null distribution of this test statistic will be
> well-defined (that is, it will not depend on the values of other,
> unknown parameters) and we can evaluate probabilities associated with
> it.  That does happen in the case of the Gaussian linear model.  I
> personally don't think it will be possible to possible to provide a
> general approach that isolates the effect of a fixed-effect term in a
> linear mixed model using a statistic that does not depend on the
> values of other parameters.  I would be delighted if someone can do it
> but I think there is too much that goes right in the case of the
> Gaussian linear model to expect that the same incredible
> simplifications will apply to other models.
> 
> I don't feel that holy grail of inference in mixed effects models
> should be a magical formula for degrees of freedom to be associated
> with some ratio that looks like a t or an F statistic (despite the
> strongly held beliefs of those in the First Church of the
> Kenward-Roger Approximation). Certainly there has been a lot of
> statistical research related to approximating a difficult distribution
> by a more common distribution but I view this approach as belonging to
> an earlier era.  It is the approach embodied in software like SAS
> whose purpose often seems to be to evaluate difficult formulas and
> provide reams of output including every number that could possibly be
> of interest.  I think we should use the power of current and future
> computers to interact with and explore data and models for the data.
> MCMC is one way to do this.  In nonlinear regression Don and I
> advocated profiling the sum of squares function with respect to the
> values of individual parameters as another way of assessing the actual
> behavior of the model versus trying to formulate an approximation.
> I'm sure that creative people will come up with many other ways to use
> the power of computers to this end.  The point is to explore the
> actual behavior of the model/data combination, not to do just one fit,
> calculate a bunch of summary statistics, apply approximate
> distributions to get p-values and go home.

But a goal of this exploration of the model/data is still to come up 
with an approximation to a sampling distribution (of a test statistic 
or parameter estimate), right?  Ultimately, once all the exploration is 
done, the scientific researcher still wants to be able to tell his/her 
colleagues "Based on these data my conclusion about the effect of 
treatment X is ______ and my confidence in this conclusion is _____."  
That is, the researcher wants to make *inferences* from the data.

> 
> If we want to generalize methods of inference we should consider the
> whole chain of reasoning that leads us to the result rather than
> concentrating only on the last step, which is "now that I have the
> value of a statistic how do I convert it to a p-value?" or, even more
> specifically, "I have calculated something that looks like a t-ratio
> so I am going to assume that its distribution is indeed a
> t-distribution which leaves me with only one question and that is on
> how many degrees of freedom".
> 
> I appreciate that this is inconvenient to those applying such models
> to their data.  Philosophical discussions of the fundamentals of
> statistical inference are all well and good but when the referees on
> your paper say you have to provide a p-value for a particular term or
> tests, it is a practical matter, not an academic, theoretical debate.
> Those with sufficient energy and skill plus a stellar reputation as a
> researcher may be able to convince editors that p-values are not the
> "be all and end all" of data analysis - Reinhold Kleigl has been able
> to do this in some cases - but that is definitely not the path of
> least resistance.  The sad reality is that p-values have taken on a
> role as the coin of the realm in science that is far beyond what any
> statistician would have imagined.  (Apparently the default
> "significance level" of 5%, which is considered in some disciplines to
> be carved in stone, resulted from a casual comment by Fisher to the
> effect that he might regard an outcome that would be expected to occur
> less than, say, 1 time in 20 as "significant".)

This paragraph brings to mind some comments about p-values and
hypothesis 
tests that seem popular on this list.  Among users of lme4 a theme seems
to 
be that "this ignorant editor/referee is insisting that I demonstrate
that 
my discovery of the effect of treatment X is not a false positive.  How
can 
I get around this?"

Every field of science needs to protect its literature from being
overwhelmed 
by false "discoveries".  Since all the professional and economic rewards
go to 
those who make discoveries, there is enormous incentive to claim that
one 
has found an effect or association, and so there needs to be a reality
check.  
The conventional way of judging these claims is via p-values and
hypothesis 
tests.  Granting all of their faults and limitations, what do people
propose 
as a better way?  

Some will probably say we should focus on (interval) estimation of
parameters 
rather than testing.  But this doesn't solve the problem at hand.
Remember 
how this discussion started:  the difficulty of calculating reliable
p-values 
for fixed effects.  Because of the duality between hypothesis tests and
confidence 
intervals, if you can't get reliable p-values, then essentially by
definition 
you can't get reliable confidence intervals either.  So the issue of
testing-
versus-estimation is a red herring with respect to the deeper problem of

inference for fixed effects in mixed models.

Others may try to dodge the problem by claiming to be Bayesians,
calculating 
credible intervals from posterior distributions.  But this won't hold
water 
if they are using lme4 and mcmcsamp.  No honest Bayesian can use an 
"uninformative" prior (if such a thing even exists), or at least not
more 
than once in any area of research:  after analysis of one data set, the 
posterior from that analysis should inform the prior for the next, but
lme4 
has its priors hard-coded.  I think the real rationale for mcmcsamp is
the 
hope that it will produce results with good frequentist properties.  I
am 
not aware that this has been demonstrated for mixed models in the
peer-reviewed 
statistics literature.

> 
> It is unhelpful of me not to provide p-values in the lmer summaries
> but I develop the software out of interest in doing it as well as I
> possibly can and not because someone assigns me a task to compute
> something.  I really don't know of a good way of assigning p-values to
> t-ratios or F-ratios so I don't.  I still report the ratio of the
> estimate divided by it standard error, and even call it a t-ratio,
> because I think it is informative.
> 

Doug, can you elaborate on that last clause?  In what way is the
(absolute) 
ratio |T| informative that the monotonic transformation 2*(1 -
pnorm(abs(T)))
is not?  In other words, if a p-value (based in this case on a standard 
normal) is not reliable for inference, what inferential value does T
have?
Less formally, if T = 2, for example, what exactly do you conclude about

the parameter, and what is your confidence in that conclusion?

   [ ... ]
> 
> I feel that the likelihood ratio is a perfectly reasonable way of
> comparing two model fits where one is a special case of the other.  In
> fact, if the models have been fit by maximum likelihood, the
> likelihood ratio would, I think, be the first candidate for a test
> statistic.  The problem with likelihood ratio tests is not the
> likelihood ratio, per se -- it is converting the likelihood ratio to a
> p-value.  You need to be able to evaluate the distribution of the
> likelihood ratio under the null hypothesis.  The chi-square
> approximation to the distribution is exactly that - an approximation -
> and its validity depends on not testing at the boundary and on having
> a large sample, in some sense of the sample size.  If I were really
> interested in evaluating a p-value for the likelihood ratio I would
> probably try a parametric bootstrap to get a reference distribution.
> 

Yes, but the problem is that (as you noted earlier in your post) the
null 
distribution depends on the values of the nuisance parameters, except in

very special cases.  A simple parametric bootstrap conditions on the 
estimated values of those parameters, as if they were known.
Nevertheless, 
a function to do this might be a useful addition to lme4.

Rich Raubertas
Merck & Co.
Notice:  This e-mail message, together with any attachme...{{dropped:12}}