[R-sig-ME] Nested error term and unbalanced design

Ben Bolker bbolker at gmail.com
Mon Feb 25 23:48:36 CET 2013

On 13-02-25 12:55 PM, Baldwin, Jim -FS wrote:
> I think someone wise said "When you find yourself in a hole, first
> put down the shovel."  Someday I'll learn that.  (Maybe today.)  What
> follows is likely from my lack of biological (and maybe statistical)
> knowledge.
> The setup seems to be that individual birds (classified as to their
> species and habitat) are checked for the presence of ticks.  For each
> species and habitat combination there is a proportion of birds with
> ticks.  Each species is also classified as to genus and family.  It
> is of interest to see if there are differences among genus and family
> classifications.  I see everything as a fixed effect in this case.
> I see no random effects or a relevant variance component as I can't
> imagine that for any genus and family that there is actually a random
> sample from all species within that family (especially if there are
> only a small number of species within a particular family to select
> from).

  I have a different definition of random effects, more along the
pragmatic/Bayesian than the philosophical/frequentist (this is discussed
at more length at http://glmm.wikidot.com/faq ).  In essence, I make the
distinction between fixed and random effects more on the criteria

 * is it useful to estimate these parameters with shrinkage? (yes=random)


 * would I rather have the ability to extrapolate to unmeasured
units/make inferences about the variation among units (random) or to
make inferential statements about differences between particular sets of
units (fixed)?

 I do *not* make much use of the experimental-design criterion (were
these units selected randomly, or could they have been selected
randomly, from a larger set of values)?

  So I see no problem in treating family/genus/species as random
effects.  Opinions differ, though.

> If a family (either within a habitat type or across habitat types) is
> to be compared to another family, it would seem that the first
> comparison would be among the mean of the species proportions (or
> maybe the mean of the logits or probits) for each family).
> Next it is conceivable that one might want to know if the variability
> of the species within a family varies among families.  That could be
> done by defining/declaring the summary statistic of interest to be
> the variance of the "true" proportions within a family and one would
> use the sample data to estimate those variances.  But these variances
> would be as summary statistics rather than a variance component
> essential to the definition of the model.  The underlying model would
> simply be the number of birds with ticks following a binomial
> distribution with the proportion of birds with ticks being a function
> of species and habitat.

  This is a sensible question, but hard to set up within lme4.  The
random effects coded in lme4 (and in most GLMMs) quantify whether the
mean (on the link scale = logit/probit/etc.) differs among units, not
whether the variation differs.  You could do this in AD Model
Builder/WinBUGS/Stan/etc.  (I think this has been discussed before on
the list.)
> I agree with the article you mentioned concerning the use of random
> coefficient models.  I just don't see treating species as a randomly
> selected subject from a family of species.  (Maybe treating insect
> species as a randomly selected species within a family where there
> are zillions of species but not for critters much higher up the food
> chain.)
> Jim
> -----Original Message----- From:
> r-sig-mixed-models-bounces at r-project.org
> [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben
> Bolker Sent: Monday, February 25, 2013 7:27 AM To:
> r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Nested error
> term and unbalanced design
> Baldwin, Jim -FS <jbaldwin at ...> writes:
>> While there is a definite order to family, genus, and species (no
>> pun intended), I think that the "nestedness" (if any) would be
>> related to how you selected your sampling units rather than the
>> fixed effects of family, genus, and species.  (I admit bias in
>> rarely if ever considering species as a random effect.)
>> Jim
> I think I respectfully disagree ... see below ...
>> I am trying to run a model that incorporates both environmental 
>> variables and taxonomic relationships, and I am unsure if I am 1) 
>> specifying the error term correctly, and 2) accounting for
>> unbalanced data correctly. I would appreciate any guidance you can
>> provide.
>> As a simplified example, I want to ask if a bird is more likely to
>> be carrying ticks based on the habitat it was caught in, and what
>> kind of bird it is (my actual model has many more environmental
>> variables). We have many related species in multiple genera in
>> multiple families, but all in the same order. Species is nested
>> within genus, and genus is nested within family. I want to estimate
>> a fixed effect for both habitat and species, while accounting for
>> the nestedness of the relationships of the birds, and I also want
>> to account for the fact that we caught more of certain species than
>> others.
>> My simplified model looks like this:
>> family=binomial(link="logit"))
>> where y is a column vector of (tick presence, tick absence)
>> So my questions are: is this the correct "grammar" for the nested
>> error? and does the nested error structure by itself take into
>> account the unbalanced data structure?
> Generally you don't have to worry about lack of balance in 'modern'
> mixed models unless it's really extreme.

> I'm having a little bit of a hard time conceptually with the idea of
> having species as a fixed effect _and_ having the variances of family
> and genus be random.  You certainly shouldn't have a categorical
> predictor (SPECIES) appear as both a random and a fixed effect,
> though.

> M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS), 
> family=binomial(link="logit"))
> *might* work (I would give it a try and see if the results are
> sensible). I would also consider
> family=binomial(link="logit"))
> if your data set is big enough to support it.  This allows for
> habitat to have different effects on different species ... (see a
> paper by Schielzeth and Forstmeier on the importance of including
> interactions between fixed and random effects: 
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
> _______________________________________________ 
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> This electronic message contains information generated by the USDA
> solely for the intended recipients. Any unauthorized interception of
> this message or the use or disclosure of the information it contains
> may violate the law and subject the violator to civil or criminal
> penalties. If you believe you have received this message in error,
> please notify the sender and delete the email immediately.

More information about the R-sig-mixed-models mailing list