[R-sig-ME] How many groups is enough?

Mon Aug 31 10:25:27 CEST 2009

On 31/08/2009, at 8:00 AM, Kingsford Jones wrote:

> Here are some thoughts, which are just conjecture (caveat emptor).
> I'd be interested in hearing contrary facts or opinion.
>
> Because of the flexibility of mixed models I think it's hard to come
> up with rules of thumb here. To examine the various scenarios,
> simulations would need to look at effects of number of groups, number
> of observations within groups and the balance of those observations
> between groups, error variance, group variance, ratio of error and
> group variances, number of levels, types of random slopes, random
> effects covariance structure, error covariance structure, fixed
> effects structure, etc...  Clearly permutations of the above could
> lead to an awful lot of simulations, not to mention what happens when
> you move away from normal errors and work with GLMMs.
>
> My guess is that in general, for small numbers of groups (or just
> small between group variance?) the sampling distribution of the
> between group variance will have a long right tail and large spread.
> Because the REML estimates are unbiased this would imply that when you
> have few groups the majority (and perhaps a large majority) of the
> estimates will be low, while some will be very high.
>

One point, is that for most analyses we are not interested in  
estimates of the random effect variances. My impression is that other  
parameter estimates are fairly robust to the random effects variance,  
so if the models fit sensibly then it seems a reasonable approach. One  
problem with a small number of groups may be the use of Empirical  
Bayes, as it ignores the estimate uncertainty. I assume somebody has  
written a paper on this, advocating full Bayesian analysis. People  
seem happy to do random effects meta-analysis with only a few trials.

Ken

> So the question remains: "what is a 'small' number of groups?".  I'm
> not sure but the following may be suggestive, at least of the symmetry
> of the sampling distribution (i.e. chi sq w/ df = # groups - 1):
>
> ngroups <- c(4, 6, 10, 15, 20)
> plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
> for (i in ngroups) {
>  plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
> }
>
>
> Also, googling turned up the paper below, which for a sub-class of
> mixed models suggests that >=50 groups is sufficient to get
> group-level variances and standard errors that are unbiased (but not
> necessarily low-variance, AFAICS).
>
> @article{maas2005sufficient,
>  title={{Sufficient sample sizes for multilevel modeling}},
>  author={Maas, C.J.M. and Hox, J.J.},
>  journal={Methodology},
>  volume={1},
>  number={3},
>  pages={86--92},
>  year={2005}
>  abstract={An important problem in multilevel modeling is what
> constitutes a sufficient sample size for accurate estimation. In
> multilevel analysis, the major restriction is often the higher-level
> sample size. In this paper, a simulation study is used to determine
> the influence of different sample sizes at the group level on the
> accuracy of the estimates (regression coefficients and variances)
> and their standard errors. In addition, the influence of other
> factors, such as the lowest-level sample size and different variance
> distributions between the levels (different intraclass correlations),
> is examined. The results show that only a small sample size
> at level two (meaning a sample of 50 or less) leads to biased
> estimates of the second-level standard errors. In all of the other
> simulated conditions the estimates of the regression coefficients, the
> variance components, and the standard errors are unbiased
> and accurate.}
> }
>
>
> hth,
>
> Kingsford Jones
>
>
>
>
> On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
> Ltd.<highstat at highstat.com> wrote:
>>
>>>
>>> Alain Zuur's response to a recent posting raises an interesting  
>>> question.
>>> To
>>> use a random effects model what number
>>>
>>> of groups is actually sufficient?
>>>
>>>
>>> I have heard talk of a minimum of 20 groups but have seen numerous
>>> examples
>>> in books and published papers with
>>>
>>> much less than this. Is there a definitive reference on this?
>>>
>>>
>>
>> Graham,
>>
>> Actually..it turned out that the data set for which the question  
>> was asked,
>> had about 350 subjects I believe.
>>
>> But anyway....that is not your question. In general you see the  
>> magic "5" in
>> some textbooks.....but for what it is worth...I recently had to  
>> program a
>> ZIP for 2-way nested data in RBugs..and in order to do this, I  
>> started with
>> 1-way and 2-way GLMMs (just to build up the code). And to check  
>> whether my
>> code was "correct", I compared the results with that of 3-4 R  
>> packages (e.g.
>> glmmPQL, lmer, glmml).  The data set consisted of multiple  
>> observations per
>> animal, for 5-30 animals per colony, and 9 colonies. I noticed that  
>> the
>> estimated values for the variance for the random intercept colony  
>> differed a
>> lot between these packages. But all came with similar estimates for  
>> the
>> animal-within-colony random intercept.
>>
>> Not that it tells you that much (all packages giving the same  
>> result doesn't
>> mean it is correct)....but it is a bit worrying. Perhaps a  
>> simulation study
>> gives you a better answer. The data I use(d) are highly  
>> unbalanced..so that
>> may have played a role as well.
>>
>> Alain
>>
>>
>>
>>
>>
>> --
>>
>>
>> Dr. Alain F. Zuur
>> First author of:
>>
>> 1. Analysing Ecological Data (2007).
>> Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p.
>> URL: www.springer.com/0-387-45967-7
>>
>>
>> 2. Mixed effects models and extensions in ecology with R. (2009).
>> Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer.
>> http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9
>>
>>
>> 3. A Beginner's Guide to R (2009).
>> Zuur, AF, Ieno, EN, Meesters, EHWG. Springer
>> http://www.springer.com/statistics/computational/book/978-0-387-93836-3
>>
>>
>> Other books: http://www.highstat.com/books.htm
>>
>>
>> Statistical consultancy, courses, data analysis and software
>> Highland Statistics Ltd.
>> 6 Laverock road
>> UK - AB41 6FN Newburgh
>> Tel: 0044 1358 788177
>> Email: highstat at highstat.com
>> URL: www.highstat.com
>> URL: www.brodgar.com
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>