[R-sig-ME] Nested error term and unbalanced design

Mon Feb 25 18:55:34 CET 2013

I think someone wise said "When you find yourself in a hole, first put down the shovel."  Someday I'll learn that.  (Maybe today.)  What follows is likely from my lack of biological (and maybe statistical) knowledge.

The setup seems to be that individual birds (classified as to their species and habitat) are checked for the presence of ticks.  For each species and habitat combination there is a proportion of birds with ticks.  Each species is also classified as to genus and family.  It is of interest to see if there are differences among genus and family classifications.  I see everything as a fixed effect in this case.

I see no random effects or a relevant variance component as I can't imagine that for any genus and family that there is actually a random sample from all species within that family (especially if there are only a small number of species within a particular family to select from).

If a family (either within a habitat type or across habitat types) is to be compared to another family, it would seem that the first comparison would be among the mean of the species proportions (or maybe the mean of the logits or probits) for each family).

Next it is conceivable that one might want to know if the variability of the species within a family varies among families.  That could be done by defining/declaring the summary statistic of interest to be the variance of the "true" proportions within a family and one would use the sample data to estimate those variances.  But these variances would be as summary statistics rather than a variance component essential to the definition of the model.  The underlying model would simply be the number of birds with ticks following a binomial distribution with the proportion of birds with ticks being a function of species and habitat.

I agree with the article you mentioned concerning the use of random coefficient models.  I just don't see treating species as a randomly selected subject from a family of species.  (Maybe treating insect species as a randomly selected species within a family where there are zillions of species but not for critters much higher up the food chain.)

Jim

-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben Bolker
Sent: Monday, February 25, 2013 7:27 AM
To: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] Nested error term and unbalanced design

Baldwin, Jim -FS <jbaldwin at ...> writes:

>  While there is a definite order to family, genus, and species (no pun
> intended), I think that the "nestedness" (if any) would be related to
> how you selected your sampling units rather than the fixed effects of
> family, genus, and species.  (I admit bias in rarely if ever
> considering species as a random effect.)

> Jim

  I think I respectfully disagree ... see below ...

> I am trying to run a model that incorporates both environmental
> variables and taxonomic relationships, and I am unsure if I am 1)
> specifying the error term correctly, and 2) accounting for unbalanced
> data correctly. I would appreciate any guidance you can provide.

> As a simplified example, I want to ask if a bird is more likely to be
> carrying ticks based on the habitat it was caught in, and what kind of
> bird it is (my actual model has many more environmental variables). We
> have many related species in multiple genera in multiple families, but
> all in the same order. Species is nested within genus, and genus is
> nested within family. I want to estimate a fixed effect for both
> habitat and species, while accounting for the nestedness of the
> relationships of the birds, and I also want to account for the fact
> that we caught more of certain species than others.

> My simplified model looks like this:
>
> M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES),
> family=binomial(link="logit"))
>
> where y is a column vector of (tick presence, tick absence)
>
> So my questions are: is this the correct "grammar" for the nested error?
> and does the nested error structure by itself take into account the
> unbalanced data structure?

   Generally you don't have to worry about lack of balance in 'modern' mixed models unless it's really extreme.

  I'm having a little bit of a hard time conceptually with the idea of having species as a fixed effect _and_ having the variances of family and genus be random.  You certainly shouldn't have a categorical predictor (SPECIES) appear as both a random and a fixed effect, though.

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
     family=binomial(link="logit"))

*might* work (I would give it a try and see if the results are sensible).
I would also consider

M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
     family=binomial(link="logit"))

if your data set is big enough to support it.  This allows for habitat to have different effects on different species ... (see a paper by Schielzeth and Forstmeier on the importance of including interactions between fixed and random effects:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )

_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.