[R-sig-ME] How many groups is enough?

Mon Aug 31 23:47:58 CEST 2009

Le lundi 31 août 2009 à 07:33 -0500, Douglas Bates a écrit :
> On Mon, Aug 31, 2009 at 3:25 AM, Ken Beath<ken at kjbeath.com.au> wrote:
> > On 31/08/2009, at 8:00 AM, Kingsford Jones wrote:
> >
> >> Here are some thoughts, which are just conjecture (caveat emptor).
> >> I'd be interested in hearing contrary facts or opinion.
> >>
> >> Because of the flexibility of mixed models I think it's hard to come
> >> up with rules of thumb here. To examine the various scenarios,
> >> simulations would need to look at effects of number of groups, number
> >> of observations within groups and the balance of those observations
> >> between groups, error variance, group variance, ratio of error and
> >> group variances, number of levels, types of random slopes, random
> >> effects covariance structure, error covariance structure, fixed
> >> effects structure, etc...  Clearly permutations of the above could
> >> lead to an awful lot of simulations, not to mention what happens when
> >> you move away from normal errors and work with GLMMs.
> >>
> >> My guess is that in general, for small numbers of groups (or just
> >> small between group variance?) the sampling distribution of the
> >> between group variance will have a long right tail and large spread.
> >> Because the REML estimates are unbiased this would imply that when you
> >> have few groups the majority (and perhaps a large majority) of the
> >> estimates will be low, while some will be very high.
> >>
> >
> > One point, is that for most analyses we are not interested in estimates of
> > the random effect variances. My impression is that other parameter estimates
> > are fairly robust to the random effects variance, so if the models fit
> > sensibly then it seems a reasonable approach. One problem with a small
> > number of groups may be the use of Empirical Bayes, as it ignores the
> > estimate uncertainty. I assume somebody has written a paper on this,
> > advocating full Bayesian analysis. People seem happy to do random effects
> > meta-analysis with only a few trials.
> 
> I agree that the precision of estimates of the variance components can
> be poor and that this is not that much of a problem when one is
> primarily interested in the estimates of the fixed-effects parameters.
>  (By the way, the statement that "REML estimates are unbiased" is not
> true in general.  Even in the simple, balanced cases where they are
> unbiased, I don't think it is an important property because the
> distribution of the estimator is so skewed that characterizing the
> distribution by its mean is unrealistic.).

This almost sounds as a plea for full Bayesian analysis (aiming at
assessing the posterior distribution of a parameter, rather than
gambling on a constant-but-forever-unknown value), possibly using
low-information conjugate priors. In his small but interesting textbook,
Jim Alberts illustrates the difficulty of assessing inter-groups
variance (or variance-like) parameters. By totally different wys, you
reach the same conclusions...

> I produced some plots of the profiled likelihood of the variance
> components for a simple, balanced example with 6 groups (a model for
> the Dyestuff data).  They are rather sobering although one should
> expect highly skewed patterns for a variance estimate (think of the
> simplest case of the estimate of a variance from the mythical i.i.d.
> Gaussian sample).  The plots are available in
> http://lme4.r-forge.r-project.org/slides/2009-07-21-Seewiesen/4PrecisionD.pdf

May I suggest that further releases of lmer (and possibly nlme) contain
at least a vignette pointing to this (and possibly your previous
presentations, that I found quite useful for explaining the intricacies
of mixed-model to non-statisticians) ? Of course, a full book should be
preferable (and I understand that this is one of your goals), but in the
interim, these slides should avoid you lot of similar questions on the
list.

Another suggestion : give in a vignette the current state of confidence
one should put in various parts of the lmer pckage ; for example, I
understand that you have had doubts about the mcmc algorithms : are
these doubts confirmed ? infirmed ? still lingering ? Can we use them
"safely" to get confidence intervals on fixed effects coefficients ? for
random effects variances ? or (heavens ...) feeding p-values to journal
editors that crave them (BTW : wht about "mcmcpavlues" in package
"languageR" ?) ? If not, what can be done to answer those basic
questions ?

Respectfully yours,

					Emmanuel Charpentier

> >
> > Ken
> >
> >> So the question remains: "what is a 'small' number of groups?".  I'm
> >> not sure but the following may be suggestive, at least of the symmetry
> >> of the sampling distribution (i.e. chi sq w/ df = # groups - 1):
> >>
> >> ngroups <- c(4, 6, 10, 15, 20)
> >> plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
> >> for (i in ngroups) {
> >>  plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
> >> }
> >>
> >>
> >> Also, googling turned up the paper below, which for a sub-class of
> >> mixed models suggests that >=50 groups is sufficient to get
> >> group-level variances and standard errors that are unbiased (but not
> >> necessarily low-variance, AFAICS).
> >>
> >> @article{maas2005sufficient,
> >>  title={{Sufficient sample sizes for multilevel modeling}},
> >>  author={Maas, C.J.M. and Hox, J.J.},
> >>  journal={Methodology},
> >>  volume={1},
> >>  number={3},
> >>  pages={86--92},
> >>  year={2005}
> >>  abstract={An important problem in multilevel modeling is what
> >> constitutes a sufficient sample size for accurate estimation. In
> >> multilevel analysis, the major restriction is often the higher-level
> >> sample size. In this paper, a simulation study is used to determine
> >> the influence of different sample sizes at the group level on the
> >> accuracy of the estimates (regression coefficients and variances)
> >> and their standard errors. In addition, the influence of other
> >> factors, such as the lowest-level sample size and different variance
> >> distributions between the levels (different intraclass correlations),
> >> is examined. The results show that only a small sample size
> >> at level two (meaning a sample of 50 or less) leads to biased
> >> estimates of the second-level standard errors. In all of the other
> >> simulated conditions the estimates of the regression coefficients, the
> >> variance components, and the standard errors are unbiased
> >> and accurate.}
> >> }
> >>
> >>
> >> hth,
> >>
> >> Kingsford Jones
> >>
> >>
> >>
> >>
> >> On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
> >> Ltd.<highstat at highstat.com> wrote:
> >>>
> >>>>
> >>>> Alain Zuur's response to a recent posting raises an interesting
> >>>> question.
> >>>> To
> >>>> use a random effects model what number
> >>>>
> >>>> of groups is actually sufficient?
> >>>>
> >>>>
> >>>> I have heard talk of a minimum of 20 groups but have seen numerous
> >>>> examples
> >>>> in books and published papers with
> >>>>
> >>>> much less than this. Is there a definitive reference on this?
> >>>>
> >>>>
> >>>
> >>> Graham,
> >>>
> >>> Actually..it turned out that the data set for which the question was
> >>> asked,
> >>> had about 350 subjects I believe.
> >>>
> >>> But anyway....that is not your question. In general you see the magic "5"
> >>> in
> >>> some textbooks.....but for what it is worth...I recently had to program a
> >>> ZIP for 2-way nested data in RBugs..and in order to do this, I started
> >>> with
> >>> 1-way and 2-way GLMMs (just to build up the code). And to check whether
> >>> my
> >>> code was "correct", I compared the results with that of 3-4 R packages
> >>> (e.g.
> >>> glmmPQL, lmer, glmml).  The data set consisted of multiple observations
> >>> per
> >>> animal, for 5-30 animals per colony, and 9 colonies. I noticed that the
> >>> estimated values for the variance for the random intercept colony
> >>> differed a
> >>> lot between these packages. But all came with similar estimates for the
> >>> animal-within-colony random intercept.
> >>>
> >>> Not that it tells you that much (all packages giving the same result
> >>> doesn't
> >>> mean it is correct)....but it is a bit worrying. Perhaps a simulation
> >>> study
> >>> gives you a better answer. The data I use(d) are highly unbalanced..so
> >>> that
> >>> may have played a role as well.
> >>>
> >>> Alain
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>> Dr. Alain F. Zuur
> >>> First author of:
> >>>
> >>> 1. Analysing Ecological Data (2007).
> >>> Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p.
> >>> URL: www.springer.com/0-387-45967-7
> >>>
> >>>
> >>> 2. Mixed effects models and extensions in ecology with R. (2009).
> >>> Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer.
> >>> http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9
> >>>
> >>>
> >>> 3. A Beginner's Guide to R (2009).
> >>> Zuur, AF, Ieno, EN, Meesters, EHWG. Springer
> >>> http://www.springer.com/statistics/computational/book/978-0-387-93836-3
> >>>
> >>>
> >>> Other books: http://www.highstat.com/books.htm
> >>>
> >>>
> >>> Statistical consultancy, courses, data analysis and software
> >>> Highland Statistics Ltd.
> >>> 6 Laverock road
> >>> UK - AB41 6FN Newburgh
> >>> Tel: 0044 1358 788177
> >>> Email: highstat at highstat.com
> >>> URL: www.highstat.com
> >>> URL: www.brodgar.com
> >>>
> >>> _______________________________________________
> >>> R-sig-mixed-models at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>>
> >>
> >> _______________________________________________
> >> R-sig-mixed-models at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>
> >
> > _______________________________________________
> > R-sig-mixed-models at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >
>