[R] question in using nlme and lme4 for unbalanced data

Mon Oct 25 22:15:22 CEST 2010

Bert Gunter <gunter.berton <at> gene.com> writes:

  Thanks for starting this.  I don't really disagree, much.

> I would say that these are not really R questions, but about
> statistical procedures in general. However, there are some important
> issues that you touch on, so I cannot resist commenting. Hopefully my
> comments might pique others to add theirs -- and perhaps to disagree
> with me.
> 
> 1. Unbalanced data?
> Yes that's what they're for. However, you need to  recognize that with
> unbalanced data, interpretability of coefficients may be lost (the
> estimates depend on what factors you include in your model).

  Agreed.

> 2.Proper analysis?
> TRUST in DOUG. (Doug Bates, the author of the packages and an
> acknowledged expert in the field).

  Yes, you seem to have encoded your design (randomized complete
block, with a twist) correctly.  You can in principle quantify
the variation in mean response among blocks [(1|block)], but not
the variation in treatments among blocks [(ttt|block) , where ttt
is a factor].  It is not surprising, but a good sign, that you
get identical coefficient estimates from lme and lmer -- this
might not be the case if you had an extremely exotic experimental
design (in which case you might uncover a bug), or if your estimates
are numerically unstable (in which case the different algorithms
underlying the two different functions might give different
answers).

> 3. Without P values, how do I I tell what's significant?
> Well, first of all, P values/statistical significance are only
> meaningful when hypotheses are pre-specified; and even then,
> statistical significance is a function of effect magnitude, sample
> size, and experimental noise, so usually doesn't accord with
> scientific importance anyway unless the experiments have been powered
> appropriately, which is rarely the case outside clinical trials. So
> statistical significance is probably irrelevant to the scientific
> questions iin any case.
> 
> I would say that you need to focus on effect (coefficient) sizes and
> perhaps do sensitivity analyses to see how much predicted results
> change as you change them. Good graphs of your data are also always a
> good idea.

  This is true, in general, but it's also true that you can't always
fight against the machine.  If you can decide on the basis of your
design and examination of a standard statistical reference
(Gotelli and Ellison and Quinn and Keough are good for ecologists)
what the appropriate 'denominator degrees of freedom' would be for a nearly
equivalent situation (it would be good to check the DF that lme tells you)
then you can do the t or F tests yourself -- if uncertain then pick 
conservative values (i.e. the smaller of the plausible denominator DFs).
Or trust what lme (representing a younger and ?less wise? Doug Bates)
tells you.

 Be warned that the values you get from summary() represent marginal
tests, which very likely don't make sense when there is an interaction
present in the model.  Read ?anova.lme , and make sure you know the
difference between sequential and marginal tests.

> 4, With only 5 blocks, you have practically no information to estimate
> variance anyway. You're probably better off treating them as fixed
> effects and just estimating the model via lm. You might find a sqrt or
> log transformation of the enfa numbers to be useful, though...

  Note also that the estimates you are getting from both lmer and lm
say that there is zero among-block variance in the intercept. As Bert
says, that really just indicates that you don't have very much
information about among-block variation (the data are consistent
with all the error being among individuals). If there is someone in
a position of power over you who will review your work, the pragmatic
thing to do is to ask them what they recommend, because anything you
do at this point (ignore the block variation entirely, treat it as
random, or treat it as fixed) will be unacceptable to some members
of the statistical or ecological community.  It won't make very
much difference to your answers either way (might change a p-value
from <0.05 to >0.05 or vice versa, but in reality that does not
mean the answers are much different).

> 
> Cheers,
> Bert
>