[R-sig-ME] Binomial GLMM vs GLM question

Sat May 17 21:45:06 CEST 2008

Hi all,
Thank you for all for you informative and helpful comments about my 
dilemma. I've pasted below a summary table of egg survival at the two 
ponds during the three years and the GLM output for a model looking for 
interactions between pond, rainfall and year. Below that is the output 
of a LMER model Ben suggested. Also, as Ben wondered earlier, the data 
are highly overdispersed. To give a little background, this wasn't an 
experiment but an observational study of what was happening in nature. 
The two ponds were selected because they are near my field site, no 
other reason. They differ in the amount of canopy shade overhead, which 
in turn is clearly going to affect egg desiccation (the primary source 
of mortality). Rainfall did vary during the three years (it was much 
heavier in 2005 than in 2003 or 2004). As a result, mortality was higher 
at Bridge Pond (little shade) than at Ocelot Pond, and it varied between 
years. These points are evident from the GLM anova table. All three 
factors, significantly affect egg mortality, pond being the most 
important. Rainfall varied across the three years, and there is a 3-way 
interaction which indicates to me simply that the effects of habitat 
(pond) and rainfall are not consistent across time. The model estimates 
make sense (e.g. rainfall has a negative slope, the estimate for Ocelot 
Pond is highly negative, etc.) In other words, the GLM output does make 
biological sense given what we know about the data and lends itself to 
fairly easy interpretation. Regarding the LMER output, it seems the 
variance of both rain and pond is obscenely huge. Is that a result of 
only have 2 ponds and 3 years? The estimate for the rainfall effect is 
negative, like in the GLM, which makes sense as well (as rain increase, 
mortality decreases). Any opinions on where to go from here? Again, 
thank you all for you helpful advice. I am very appreciative.
-Justin


Egg Survival
Year Bridge Ocelot Total
2003 0.069 0.453 0.251
2004 0.115 0.429 0.272
2005 0.270 0.505 0.376
Total 0.141 0.460 0.292


 > GLM.2 <- glm(tbl_mort ~ Pond *total_rainfall *Year , 
family=quasibinomial(logit), data=FieldData0305)

 > summary(GLM.2)

Call:
glm(formula = tbl_mort ~ Pond * total_rainfall * Year, family = 
quasibinomial(logit),
data = FieldData0305)

Deviance Residuals:
Min 1Q Median 3Q Max
-17.965 -4.838 1.062 5.197 15.225

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.68318 0.47947 5.596 4.53e-08 ***
Pond[T.Ocelot] -2.30872 0.53614 -4.306 2.18e-05 ***
total_rainfall -0.11465 0.02333 -4.915 1.39e-06 ***
Year[T.2004] -0.97218 0.63583 -1.529 0.127198
Year[T.2005] -3.92557 0.76808 -5.111 5.38e-07 ***
Pond[T.Ocelot]:total_rainfall 0.09799 0.02609 3.756 0.000204 ***
Pond[T.Ocelot]:Year[T.2004] 0.48230 0.77189 0.625 0.532506
Pond[T.Ocelot]:Year[T.2005] 2.15914 1.22853 1.757 0.079739 .
total_rainfall:Year[T.2004] 0.08475 0.02717 3.120 0.001967 **
total_rainfall:Year[T.2005] 0.12082 0.02440 4.952 1.16e-06 ***
Pond[T.Ocelot]:total_rainfall:Year[T.2004] -0.05511 0.03585 -1.537 0.125162
Pond[T.Ocelot]:total_rainfall:Year[T.2005] -0.11371 0.02948 -3.857 
0.000137 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 38.23861)

Null deviance: 20762 on 349 degrees of freedom
Residual deviance: 15252 on 338 degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

 > Anova(GLM.2, test="F")
Anova Table (Type II tests)

Response: tbl_mort
SS Df F Pr(>F)
Pond 1002.7 1 26.2229 5.119e-07 ***
total_rainfall 238.2 1 6.2306 0.0130318 *
Year 496.3 2 6.4894 0.0017157 **
Pond:total_rainfall 101.7 1 2.6599 0.1038378
Pond:Year 172.7 2 2.2587 0.1060579
total_rainfall:Year 650.3 2 8.5037 0.0002493 ***
Pond:total_rainfall:Year 704.8 2 9.2159 0.0001267 ***
Residuals 12924.5 338
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


 > m1 <- lmer(tbl_mort~total_rainfall + (1|Year)+(1|Pond), family = 
quasibinomial, data=FieldData0305)

 > summary(m1)
Generalized linear mixed model fit using Laplace
Formula: tbl_mort ~ total_rainfall + (1 | Year) + (1 | Pond)
Data: FieldData0305
Family: quasibinomial(logit link)
AIC BIC logLik deviance
14657 14673 -7325 14649
Random effects:
Groups Name Variance Std.Dev.
Year (Intercept) 2.0141e+15 44878595
Pond (Intercept) 1.3427e+15 36643219
Residual 8.8116e+16 296844003
number of obs: 350, groups: Year, 3; Pond, 2

Fixed effects:
Estimate Std. Error t value
(Intercept) 7.486e-01 5.736e+13 1.305e-14
total_rainfall -2.316e-02 4.406e+11 -5.257e-14

Correlation of Fixed Effects:
(Intr)
total_rnfll -0.956


> On 17/05/2008, at 6:53 AM, Ben Bolker wrote:
>
> <snip>
> >/
> />/ | I think this is a good example of what seems a common problem for
> />/ | people using mixed models - How to decide what are the random  
> />/ factors
> />/ | and what are the fixed factors.
> />/
> />/ ~  I agree that this is difficult, and contentious (see e.g.
> />/ Andrew Gelman, "Analysis of variance--why it is more important than
> />/ ever," Annals of Statistics 33, no. 1 (2005): 1-53,
> />/ doi:http://dx.doi.org/doi:10.1214/009053604000001048)
> />/
> /Available at
> http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf <http://www.stat.columbia.edu/%7Egelman/research/published/AOS259.pdf>
>
> >/
> />/ | In your study you have taken repeated measurements on the same  
> />/ ponds,
> />/ | as such, observations from each pond across years are not  
> />/ independent
> />/ | (observations in pond 1 at t+1 will be correlated with observations
> />/ | at t).  If you were only interested in the effect of rainfall on egg
> />/ | survival I would suggest you use a mixed model with year and pond as
> />/ | random effects (based on a similar eg in Crawley's R Book p605). In
> />/ | this case I assume you have measurements on every pond in every year
> />/ | and as such these random effects are crossed (not nested).
> />/
> />/ ~  (a) I would suggest that a mixed model is NOT going to work well
> />/ in this case, even if year and pond are "philosophically" random  
> />/ effects
> />/ (i.e., you don't really care about what happens in those specific
> />/ ponds, and you may even have chosen them with a random-number  
> />/ generator
> />/ out of a list of all possible ponds -- although this is much less
> />/ likely with years ...).  The technical problem is that estimating
> />/ variances from 2 or 3 points is nasty.  This translates into
> />/ inference/philosophical terms because these few points really
> />/ don't give you the data to generalize about the population, even if
> />/ you want to.  (I think one of the confusions is that in the classical
> />/ method-of-moments world there's nothing that says you can't have 2
> />/ denominator degrees of freedom -- your power will be terrible, but
> />/ the expressions won't blow up on you [unless you get negative variance
> />/ estimates ...])
> />/ |
> />/ | m1 <- lmer(mort~rainfall + (year|pond), family = binomial, data=
> />/ | FieldData0305)
> />/
> />/ ~   (1|year)+(1|pond) might work OK, and be slightly more
> />/ parsimonious (OR ponds nested within years, (1|year)+(1|pond:year), or
> />/ vice versa -- and if you're not dead set on treating the random  
> />/ effects
> />/ as "effects of year" and "effects of pond", but were willing to  
> />/ treat it
> />/ as "effects of pond within year" that would buy you the flexibility
> />/ to use some other package that wasn't so good at crossed
> />/ effects.)
> />/
> />/
> />/ | summary(m1)
> />/ |  From the summary output you can assess the influence of the fixed
> />/ | effect to see if the estimate for rainfall (slope estimate) is
> />/ | different to zero.
> />/ |
> />/ | Even if the number of groups (years and ponds) is small it is still
> />/ | better to take account of this group variation than ignore it (see
> />/ | Gelman and Hill 200&.
> />/
> />/ ~  Yes, but ... what if all you can get out of the model is that
> />/ the estimate is nearly zero?  If you go Bayesian instead you can
> />/ deal with some problems by setting an informative prior [in theory
> />/ setting a proper prior, no matter how weak, would generally solve
> />/ the problem, but in reality I suspect that if the model is
> />/ overparameterized you're going to have nasty technical difficulties
> />/ even if the model is theoretically OK])
> />/ ~  With respect to Andrew Gelman (I can't believe I'm saying this --
> />/ sacrilege) but I think he's used to really big social-science data  
> />/ sets,
> />/ where "leave stuff in when you're not sure" is more generally
> />/ a good idea than it is in typical field ecology data sets, where
> />/ the bias-variance tradeoff bites harder.  (At least there are 350
> />/ data points here,
> />/
> />/ |  From this model you can also get a feeling for the between year and
> />/ | between pond variation by looking at the random effects estimates  
> />/ and
> />/ | variances. Obviously you will have some idea if you just plot the
> />/ | means for each year and each pond.
> />/
> />/ ~  Only if the variance estimates don't suck.
> />/
> />/ | If you are also interested in understanding how egg survival differs
> />/ | between ponds or between years and if this interacts with rainfall
> />/ | then it become less straight forward.
> />/ |
> />/ | I would suggest that you try
> />/ |
> />/ | m2 <- lmer(mort~rainfall * year * pond + (year|pond), family =
> />/ | binomial, data= FieldData0305)
> />/ | summary(m2)
> />/ |
> />/ | In the summary you will have estimates for all three factors and
> />/ | their interactions and you can ascertain if these are good
> />/ | explanatory variables for egg survival.
> />/
> />/ ~   Isn't this overparameterized?  We have a fixed effect for
> />/ each year:pond combination (and variation in the slopes of
> />/ the effect with respect to rainfall), as well as a random-effect
> />/ level for years and ponds?
> />/
> /
> This does bring up the important point as to whether the effect of  
> rainfall does vary within year and site. One advantage of the mixed  
> effects method for dealing with varying slopes is it gives a nice  
> population estimate. Unfortunately this is a lot of random effects for  
> not many groups.  It is looking very messy.
>
> I think with this sort of data, as in it has very few groups) the best  
> option is to start with the basics (probably always the best option).  
> Fit a separate model for each group and determine an appropriate  
> transform for rainfall. Do the models fit reasonably? Look at the  
> parameter estimates with 95% CI (plotting is a good idea) and see how  
> much variation there is, and there is nothing strange happening like  
> one unusual year-pond. Then decide whether the more complex models  
> will tell you anything more.
>
> Ken
> >

-- 
Justin Touchon
Doctoral Student
Boston University
Ecology, Behavior and Evolution
http://people.bu.edu/jtouchon