[R-sig-ME] How many groups is enough?

Mon Aug 31 00:00:34 CEST 2009

Here are some thoughts, which are just conjecture (caveat emptor).
I'd be interested in hearing contrary facts or opinion.

Because of the flexibility of mixed models I think it's hard to come
up with rules of thumb here. To examine the various scenarios,
simulations would need to look at effects of number of groups, number
of observations within groups and the balance of those observations
between groups, error variance, group variance, ratio of error and
group variances, number of levels, types of random slopes, random
effects covariance structure, error covariance structure, fixed
effects structure, etc...  Clearly permutations of the above could
lead to an awful lot of simulations, not to mention what happens when
you move away from normal errors and work with GLMMs.

My guess is that in general, for small numbers of groups (or just
small between group variance?) the sampling distribution of the
between group variance will have a long right tail and large spread.
Because the REML estimates are unbiased this would imply that when you
have few groups the majority (and perhaps a large majority) of the
estimates will be low, while some will be very high.

So the question remains: "what is a 'small' number of groups?".  I'm
not sure but the following may be suggestive, at least of the symmetry
of the sampling distribution (i.e. chi sq w/ df = # groups - 1):

ngroups <- c(4, 6, 10, 15, 20)
plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
for (i in ngroups) {
  plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
}

Also, googling turned up the paper below, which for a sub-class of
mixed models suggests that >=50 groups is sufficient to get
group-level variances and standard errors that are unbiased (but not
necessarily low-variance, AFAICS).

@article{maas2005sufficient,
  title={{Sufficient sample sizes for multilevel modeling}},
  author={Maas, C.J.M. and Hox, J.J.},
  journal={Methodology},
  volume={1},
  number={3},
  pages={86--92},
  year={2005}
  abstract={An important problem in multilevel modeling is what
constitutes a sufficient sample size for accurate estimation. In
multilevel analysis, the major restriction is often the higher-level
sample size. In this paper, a simulation study is used to determine
the influence of different sample sizes at the group level on the
accuracy of the estimates (regression coefficients and variances)
and their standard errors. In addition, the influence of other
factors, such as the lowest-level sample size and different variance
distributions between the levels (different intraclass correlations),
is examined. The results show that only a small sample size
at level two (meaning a sample of 50 or less) leads to biased
estimates of the second-level standard errors. In all of the other
simulated conditions the estimates of the regression coefficients, the
variance components, and the standard errors are unbiased
and accurate.}
}

hth,

Kingsford Jones

On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
Ltd.<highstat at highstat.com> wrote:
>
>>
>> Alain Zuur's response to a recent posting raises an interesting question.
>> To
>> use a random effects model what number
>>
>> of groups is actually sufficient?
>>
>>
>> I have heard talk of a minimum of 20 groups but have seen numerous
>> examples
>> in books and published papers with
>>
>> much less than this. Is there a definitive reference on this?
>>
>>
>
> Graham,
>
> Actually..it turned out that the data set for which the question was asked,
> had about 350 subjects I believe.
>
> But anyway....that is not your question. In general you see the magic "5" in
> some textbooks.....but for what it is worth...I recently had to program a
> ZIP for 2-way nested data in RBugs..and in order to do this, I started with
> 1-way and 2-way GLMMs (just to build up the code). And to check whether my
> code was "correct", I compared the results with that of 3-4 R packages (e.g.
> glmmPQL, lmer, glmml).  The data set consisted of multiple observations per
> animal, for 5-30 animals per colony, and 9 colonies. I noticed that the
> estimated values for the variance for the random intercept colony differed a
> lot between these packages. But all came with similar estimates for the
> animal-within-colony random intercept.
>
> Not that it tells you that much (all packages giving the same result doesn't
> mean it is correct)....but it is a bit worrying. Perhaps a simulation study
> gives you a better answer. The data I use(d) are highly unbalanced..so that
> may have played a role as well.
>
> Alain
>
>
>
>
>
> --
>
>
> Dr. Alain F. Zuur
> First author of:
>
> 1. Analysing Ecological Data (2007).
> Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p.
> URL: www.springer.com/0-387-45967-7
>
>
> 2. Mixed effects models and extensions in ecology with R. (2009).
> Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer.
> http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9
>
>
> 3. A Beginner's Guide to R (2009).
> Zuur, AF, Ieno, EN, Meesters, EHWG. Springer
> http://www.springer.com/statistics/computational/book/978-0-387-93836-3
>
>
> Other books: http://www.highstat.com/books.htm
>
>
> Statistical consultancy, courses, data analysis and software
> Highland Statistics Ltd.
> 6 Laverock road
> UK - AB41 6FN Newburgh
> Tel: 0044 1358 788177
> Email: highstat at highstat.com
> URL: www.highstat.com
> URL: www.brodgar.com
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>