[R-sig-ME] How many groups is enough?

Douglas Bates bates at stat.wisc.edu
Mon Aug 31 14:33:05 CEST 2009

On Mon, Aug 31, 2009 at 3:25 AM, Ken Beath<ken at kjbeath.com.au> wrote:
> On 31/08/2009, at 8:00 AM, Kingsford Jones wrote:
>> Here are some thoughts, which are just conjecture (caveat emptor).
>> I'd be interested in hearing contrary facts or opinion.
>> Because of the flexibility of mixed models I think it's hard to come
>> up with rules of thumb here. To examine the various scenarios,
>> simulations would need to look at effects of number of groups, number
>> of observations within groups and the balance of those observations
>> between groups, error variance, group variance, ratio of error and
>> group variances, number of levels, types of random slopes, random
>> effects covariance structure, error covariance structure, fixed
>> effects structure, etc...  Clearly permutations of the above could
>> lead to an awful lot of simulations, not to mention what happens when
>> you move away from normal errors and work with GLMMs.
>> My guess is that in general, for small numbers of groups (or just
>> small between group variance?) the sampling distribution of the
>> between group variance will have a long right tail and large spread.
>> Because the REML estimates are unbiased this would imply that when you
>> have few groups the majority (and perhaps a large majority) of the
>> estimates will be low, while some will be very high.
> One point, is that for most analyses we are not interested in estimates of
> the random effect variances. My impression is that other parameter estimates
> are fairly robust to the random effects variance, so if the models fit
> sensibly then it seems a reasonable approach. One problem with a small
> number of groups may be the use of Empirical Bayes, as it ignores the
> estimate uncertainty. I assume somebody has written a paper on this,
> advocating full Bayesian analysis. People seem happy to do random effects
> meta-analysis with only a few trials.

I agree that the precision of estimates of the variance components can
be poor and that this is not that much of a problem when one is
primarily interested in the estimates of the fixed-effects parameters.
 (By the way, the statement that "REML estimates are unbiased" is not
true in general.  Even in the simple, balanced cases where they are
unbiased, I don't think it is an important property because the
distribution of the estimator is so skewed that characterizing the
distribution by its mean is unrealistic.).

I produced some plots of the profiled likelihood of the variance
components for a simple, balanced example with 6 groups (a model for
the Dyestuff data).  They are rather sobering although one should
expect highly skewed patterns for a variance estimate (think of the
simplest case of the estimate of a variance from the mythical i.i.d.
Gaussian sample).  The plots are available in

> Ken
>> So the question remains: "what is a 'small' number of groups?".  I'm
>> not sure but the following may be suggestive, at least of the symmetry
>> of the sampling distribution (i.e. chi sq w/ df = # groups - 1):
>> ngroups <- c(4, 6, 10, 15, 20)
>> plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
>> for (i in ngroups) {
>>  plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
>> }
>> Also, googling turned up the paper below, which for a sub-class of
>> mixed models suggests that >=50 groups is sufficient to get
>> group-level variances and standard errors that are unbiased (but not
>> necessarily low-variance, AFAICS).
>> @article{maas2005sufficient,
>>  title={{Sufficient sample sizes for multilevel modeling}},
>>  author={Maas, C.J.M. and Hox, J.J.},
>>  journal={Methodology},
>>  volume={1},
>>  number={3},
>>  pages={86--92},
>>  year={2005}
>>  abstract={An important problem in multilevel modeling is what
>> constitutes a sufficient sample size for accurate estimation. In
>> multilevel analysis, the major restriction is often the higher-level
>> sample size. In this paper, a simulation study is used to determine
>> the influence of different sample sizes at the group level on the
>> accuracy of the estimates (regression coefficients and variances)
>> and their standard errors. In addition, the influence of other
>> factors, such as the lowest-level sample size and different variance
>> distributions between the levels (different intraclass correlations),
>> is examined. The results show that only a small sample size
>> at level two (meaning a sample of 50 or less) leads to biased
>> estimates of the second-level standard errors. In all of the other
>> simulated conditions the estimates of the regression coefficients, the
>> variance components, and the standard errors are unbiased
>> and accurate.}
>> }
>> hth,
>> Kingsford Jones
>> On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
>> Ltd.<highstat at highstat.com> wrote:
>>>> Alain Zuur's response to a recent posting raises an interesting
>>>> question.
>>>> To
>>>> use a random effects model what number
>>>> of groups is actually sufficient?
>>>> I have heard talk of a minimum of 20 groups but have seen numerous
>>>> examples
>>>> in books and published papers with
>>>> much less than this. Is there a definitive reference on this?
>>> Graham,
>>> Actually..it turned out that the data set for which the question was
>>> asked,
>>> had about 350 subjects I believe.
>>> But anyway....that is not your question. In general you see the magic "5"
>>> in
>>> some textbooks.....but for what it is worth...I recently had to program a
>>> ZIP for 2-way nested data in RBugs..and in order to do this, I started
>>> with
>>> 1-way and 2-way GLMMs (just to build up the code). And to check whether
>>> my
>>> code was "correct", I compared the results with that of 3-4 R packages
>>> (e.g.
>>> glmmPQL, lmer, glmml).  The data set consisted of multiple observations
>>> per
>>> animal, for 5-30 animals per colony, and 9 colonies. I noticed that the
>>> estimated values for the variance for the random intercept colony
>>> differed a
>>> lot between these packages. But all came with similar estimates for the
>>> animal-within-colony random intercept.
>>> Not that it tells you that much (all packages giving the same result
>>> doesn't
>>> mean it is correct)....but it is a bit worrying. Perhaps a simulation
>>> study
>>> gives you a better answer. The data I use(d) are highly unbalanced..so
>>> that
>>> may have played a role as well.
>>> Alain
>>> --
>>> Dr. Alain F. Zuur
>>> First author of:
>>> 1. Analysing Ecological Data (2007).
>>> Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p.
>>> URL: www.springer.com/0-387-45967-7
>>> 2. Mixed effects models and extensions in ecology with R. (2009).
>>> Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer.
>>> http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9
>>> 3. A Beginner's Guide to R (2009).
>>> Zuur, AF, Ieno, EN, Meesters, EHWG. Springer
>>> http://www.springer.com/statistics/computational/book/978-0-387-93836-3
>>> Other books: http://www.highstat.com/books.htm
>>> Statistical consultancy, courses, data analysis and software
>>> Highland Statistics Ltd.
>>> 6 Laverock road
>>> UK - AB41 6FN Newburgh
>>> Tel: 0044 1358 788177
>>> Email: highstat at highstat.com
>>> URL: www.highstat.com
>>> URL: www.brodgar.com
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

More information about the R-sig-mixed-models mailing list