[R-sig-ME] Nested error term and unbalanced design
Ben Bolker
bbolker at gmail.com
Mon Feb 25 23:48:36 CET 2013
On 13-02-25 12:55 PM, Baldwin, Jim -FS wrote:
> I think someone wise said "When you find yourself in a hole, first
> put down the shovel." Someday I'll learn that. (Maybe today.) What
> follows is likely from my lack of biological (and maybe statistical)
> knowledge.
>
> The setup seems to be that individual birds (classified as to their
> species and habitat) are checked for the presence of ticks. For each
> species and habitat combination there is a proportion of birds with
> ticks. Each species is also classified as to genus and family. It
> is of interest to see if there are differences among genus and family
> classifications. I see everything as a fixed effect in this case.
>
> I see no random effects or a relevant variance component as I can't
> imagine that for any genus and family that there is actually a random
> sample from all species within that family (especially if there are
> only a small number of species within a particular family to select
> from).
I have a different definition of random effects, more along the
pragmatic/Bayesian than the philosophical/frequentist (this is discussed
at more length at http://glmm.wikidot.com/faq ). In essence, I make the
distinction between fixed and random effects more on the criteria
* is it useful to estimate these parameters with shrinkage? (yes=random)
and
* would I rather have the ability to extrapolate to unmeasured
units/make inferences about the variation among units (random) or to
make inferential statements about differences between particular sets of
units (fixed)?
I do *not* make much use of the experimental-design criterion (were
these units selected randomly, or could they have been selected
randomly, from a larger set of values)?
So I see no problem in treating family/genus/species as random
effects. Opinions differ, though.
> If a family (either within a habitat type or across habitat types) is
> to be compared to another family, it would seem that the first
> comparison would be among the mean of the species proportions (or
> maybe the mean of the logits or probits) for each family).
>
> Next it is conceivable that one might want to know if the variability
> of the species within a family varies among families. That could be
> done by defining/declaring the summary statistic of interest to be
> the variance of the "true" proportions within a family and one would
> use the sample data to estimate those variances. But these variances
> would be as summary statistics rather than a variance component
> essential to the definition of the model. The underlying model would
> simply be the number of birds with ticks following a binomial
> distribution with the proportion of birds with ticks being a function
> of species and habitat.
This is a sensible question, but hard to set up within lme4. The
random effects coded in lme4 (and in most GLMMs) quantify whether the
mean (on the link scale = logit/probit/etc.) differs among units, not
whether the variation differs. You could do this in AD Model
Builder/WinBUGS/Stan/etc. (I think this has been discussed before on
the list.)
>
> I agree with the article you mentioned concerning the use of random
> coefficient models. I just don't see treating species as a randomly
> selected subject from a family of species. (Maybe treating insect
> species as a randomly selected species within a family where there
> are zillions of species but not for critters much higher up the food
> chain.)
>
> Jim
>
> -----Original Message----- From:
> r-sig-mixed-models-bounces at r-project.org
> [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben
> Bolker Sent: Monday, February 25, 2013 7:27 AM To:
> r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Nested error
> term and unbalanced design
>
> Baldwin, Jim -FS <jbaldwin at ...> writes:
>
>> While there is a definite order to family, genus, and species (no
>> pun intended), I think that the "nestedness" (if any) would be
>> related to how you selected your sampling units rather than the
>> fixed effects of family, genus, and species. (I admit bias in
>> rarely if ever considering species as a random effect.)
>
>> Jim
>
> I think I respectfully disagree ... see below ...
>
>> I am trying to run a model that incorporates both environmental
>> variables and taxonomic relationships, and I am unsure if I am 1)
>> specifying the error term correctly, and 2) accounting for
>> unbalanced data correctly. I would appreciate any guidance you can
>> provide.
>
>> As a simplified example, I want to ask if a bird is more likely to
>> be carrying ticks based on the habitat it was caught in, and what
>> kind of bird it is (my actual model has many more environmental
>> variables). We have many related species in multiple genera in
>> multiple families, but all in the same order. Species is nested
>> within genus, and genus is nested within family. I want to estimate
>> a fixed effect for both habitat and species, while accounting for
>> the nestedness of the relationships of the birds, and I also want
>> to account for the fact that we caught more of certain species than
>> others.
>
>> My simplified model looks like this:
>>
>> M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES),
>> family=binomial(link="logit"))
>>
>> where y is a column vector of (tick presence, tick absence)
>>
>> So my questions are: is this the correct "grammar" for the nested
>> error? and does the nested error structure by itself take into
>> account the unbalanced data structure?
>
> Generally you don't have to worry about lack of balance in 'modern'
> mixed models unless it's really extreme.
>
> I'm having a little bit of a hard time conceptually with the idea of
> having species as a fixed effect _and_ having the variances of family
> and genus be random. You certainly shouldn't have a categorical
> predictor (SPECIES) appear as both a random and a fixed effect,
> though.
> M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
> family=binomial(link="logit"))
>
> *might* work (I would give it a try and see if the results are
> sensible). I would also consider
>
> M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
> family=binomial(link="logit"))
>
> if your data set is big enough to support it. This allows for
> habitat to have different effects on different species ... (see a
> paper by Schielzeth and Forstmeier on the importance of including
> interactions between fixed and random effects:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
>
>
>
> This electronic message contains information generated by the USDA
> solely for the intended recipients. Any unauthorized interception of
> this message or the use or disclosure of the information it contains
> may violate the law and subject the violator to civil or criminal
> penalties. If you believe you have received this message in error,
> please notify the sender and delete the email immediately.
>
More information about the R-sig-mixed-models
mailing list