[R-sig-ME] GLMM for data with many zeros: poisson, lognormal-poisson or zero-inflated?
Alicia Valdés
aliciavaldes1501 at gmail.com
Thu May 5 11:47:09 CEST 2011
Dear members of the list,
This is my first message and as I am very new to GLMMs (and my
statistical knowledge is far from broad) I hope to formulate my
questions adequately. I have spent some days reading through the list
archives to find a solution to my problem, but I am still not sure if
I am doing right. I am analyzing data on number of seedlings emerged
in experimental plots for a perennial plant (thus, my response
variable are counts). I have performed my experiment on 6 different
sites (random factor). Each site has two areas: one with presence and
one with absence of populations of this particular species (so I have
a factor called "species", which is binomial: 0/1). Into each of these
areas I have 4 replies. Two of these replies are under closed forest
canopy and two in open areas (I called this fixed factor "cover" and
it has two levels: "high" and "low"). Finally, each of the replies
consists of 3 different treatments applied to sowed seeds (fixed
factor with three levels: A, E, C). I have made periodic revisions of
this experiment, so I have 7 dates of revision till now. I am trying
to fit a GLMM with lmer ni package lme4, in order to see the effect of
all these factors and their interactions in seedling emergence. I am
thinking of fitting a different GLMM for each revision (1-7) to see if
there are changes in significant factors.
Here is how part of my data (those for rev 1, site 1) look:
rev site species repl cover treat seedlings
1 1 0 1 low A
3
1 1 0 1 low C
0
1 1 0 1 low E
0
1 1 0 2 low A
5
1 1 0 2 low C
0
1 1 0 2 low E
3
1 1 0 3 high A
1
1 1 0 3 high C
0
1 1 0 3 high E
0
1 1 0 4 high A
0
1 1 0 4 high C
0
1 1 0 4 high E
0
1 1 1 1 low A
2
1 1 1 1 low C
0
1 1 1 1 low E
4
1 1 1 2 low A
0
1 1 1 2 low C
0
1 1 1 2 low E
1
1 1 1 3 high A
0
1 1 1 3 high C
0
1 1 1 3 high E
0
1 1 1 4 high A
2
1 1 1 4 high C
0
1 1 1 4 high E
0
...
My first problem is that 66% of this count data (taking all the
revisions together) are zeros (and into each revision the percentage
of zeros is also high). Should I then fit a zero-inflated model? This
is not available in lmer, and I investigated zeroinfl in the package
pscl but it does not allow to include random effects (and, to my
knowledge, I should include "site" as a random effect). Poisson family
is strongly suggested for count data, but using goodfit in the vcd
package I found that my data are significantly different from the
Poisson distribution. So which family should I use? I have read about
the problem with quasi distributions and why they are not any more
available in lmer. I have also read some posts talking about
overdispersion and translating the model to a lognormal-Poisson model
by using an individual-level random variable and using family=poisson.
I have tried this with my data and it worked, but I am really not sure
if I have overdispersion or not, as I do not know which parameter do I
have to look at in order to quantify overdispersion.
Here is my poisson model for the first revision (I made subsets of the
data, and data1 corresponds to the first revision):
model1<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+
(1+species+treat|site),family=poisson,data1)
Then I created a new variable with a unique value for each observation
and included this as an additional random-effect term
data1$over <- 1:nrow(data1)
And this is the lognormal-Poisson model:
model2<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+(1+species+treat|site)+(1|over),data
= data1,family = poisson)
This is the summary for both models:
> summary(model1)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species + treat | site)
Data: data1
AIC BIC logLik deviance
407.3 463.1 -184.7 369.3
Random effects:
Groups Name Variance Std.Dev. Corr
site (Intercept) 0 0
species1 0 0 NaN
treatC 0 0 NaN NaN
treatE 0 0 NaN NaN NaN
Number of obs: 139, groups: site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.190e-01 2.374e-01 0.922 0.3563
species1 -3.586e-01 2.830e-01 -1.267 0.2050
coverlow 1.113e-01 2.803e-01 0.397 0.6913
treatC -3.493e+01 2.432e+03 -0.014 0.9885
treatE -1.136e-03 3.175e-01 -0.004 0.9971
coverlow:treatC 1.761e+01 1.803e+03 0.010 0.9922
coverlow:treatE 7.783e-01 3.534e-01 2.202 0.0277 *
species1:treatC 1.768e+01 1.632e+03 0.011 0.9914
species1:treatE 4.277e-01 3.451e-01 1.240 0.2152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
species1 -0.539
coverlow -0.623 0.041
treatC 0.000 0.000 0.000
treatE -0.748 0.403 0.466 0.000
covrlw:trtC 0.000 0.000 0.000 -0.741 0.000
covrlw:trtE 0.494 -0.033 -0.793 0.000 -0.669 0.000
species1:treatC 0.000 0.000 0.000 -0.671 0.000 0.000 0.000
species1:treatE 0.442 -0.820 -0.034 0.000 -0.529 0.000 0.047 0.000
> summary(model2)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species + treat | site) + (1 | over)
Data: data1
AIC BIC logLik deviance
238.9 297.6 -99.47 198.9
Random effects:
Groups Name Variance Std.Dev. Corr
over (Intercept) 2.5371 1.5928
site (Intercept) 0.0000 0.0000
species1 0.0000 0.0000 NaN
treatC 0.0000 0.0000 NaN NaN
treatE 0.0000 0.0000 NaN NaN NaN
Number of obs: 139, groups: over, 139; site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8544 0.5720 -1.494 0.135
species1 -0.8135 0.6783 -1.199 0.230
coverlow 0.3388 0.6719 0.504 0.614
treatC -35.3404 6943.1491 -0.005 0.996
treatE 0.2528 0.7643 0.331 0.741
coverlow:treatC 17.1976 5840.7446 0.003 0.998
coverlow:treatE 0.8608 0.8823 0.976 0.329
species1:treatC 17.2652 3754.0680 0.005 0.996
species1:treatE 0.4864 0.8843 0.550 0.582
Correlation of Fixed Effects:
(Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
species1 -0.530
coverlow -0.631 0.035
treatC 0.000 0.000 0.000
treatE -0.748 0.396 0.472 0.000
covrlw:trtC 0.000 0.000 0.000 -0.841 0.000
covrlw:trtE 0.481 -0.027 -0.762 0.000 -0.635 0.000
species1:treatC 0.000 0.000 0.000 -0.541 0.000 0.000 0.000
species1:treatE 0.406 -0.767 -0.027 0.000 -0.528 0.000 0.020 0.000
I don't know why I get this 0 and NaN for the random effects. I tried
to look at the interaction of site with species an treatment, apart
from the single effect of site. Perhaps my formula specification is
not correct?
Besides, I performed an anova with the two models:
> anova(model1,model2)
Data: data1
Models:
model1: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model1: treat | site)
model2: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model2: treat | site) + (1 | over)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
model1 19 407.30 463.06 -184.651
model2 20 238.93 297.62 -99.465 170.37 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So looking at this I believe I should choose model 2, but I still
don’t see how can I know if there is an important overdispersion in
model 1. And do you think this model 2 is acceptable having too many
zeros in my data? Or should I try to fit any other kind of model? I
believe that keeping site as a fixed effect and using zeroinfl in pscl
package would be not correct at all… do you agree?
And my last question is: do you think it is ok to fit a different
model for each revision, or could I try some kind of repeated-measures
approach? I don’t know if this is available for GLMMs, and it goes far
beyond my statistical knowledge, but perhaps there is some possible
way.
Thanks a lot for your attention, and sorry for the long message. Any
kind of help or hint would be very welcome!
Regards,
Alicia Valdés Rapado
PhD Student
Dpto. Biología Organismos y Sistemas - Unidad de Ecología
Universidad de Oviedo
España - Spain
e-mail: valdesalicia.uo at uniovi.es
More information about the R-sig-mixed-models
mailing list