[R-sig-ME] GLMM for data with many zeros: poisson, lognormal-poisson or zero-inflated?
Ben Bolker
bbolker at gmail.com
Fri May 6 18:31:57 CEST 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 11-05-05 05:47 AM, Alicia Valdés wrote:
> Dear members of the list,
>
> This is my first message and as I am very new to GLMMs (and my
> statistical knowledge is far from broad) I hope to formulate my
> questions adequately. I have spent some days reading through the list
> archives to find a solution to my problem, but I am still not sure if
> I am doing right. I am analyzing data on number of seedlings emerged
> in experimental plots for a perennial plant (thus, my response
> variable are counts). I have performed my experiment on 6 different
> sites (random factor). Each site has two areas: one with presence and
> one with absence of populations of this particular species (so I have
> a factor called "species", which is binomial: 0/1).
Normally one would speak of *predictor* variables (which this is) as
"binary" rather than "binomial"
> Into each of these
> areas I have 4 replies. Two of these replies are under closed forest
> canopy and two in open areas (I called this fixed factor "cover" and
> it has two levels: "high" and "low"). Finally, each of the replies
> consists of 3 different treatments applied to sowed seeds (fixed
> factor with three levels: A, E, C). I have made periodic revisions of
> this experiment, so I have 7 dates of revision till now. I am trying
> to fit a GLMM with lmer ni package lme4, in order to see the effect of
> all these factors and their interactions in seedling emergence. I am
> thinking of fitting a different GLMM for each revision (1-7) to see if
> there are changes in significant factors.
Fitting a different GLMM for each one is simpler, but if you actually
want to make any inferences about the changes between revisions (don't
quite know what this means) you should do a single combined analysis and
include interactions with 'revision' in order to test differences in the
effects of factors across revisions: e.g. cover:revision (interaction
between cover and revision) would parameterize the variation in effect
of cover between revisions.
6 sites is a fairly small number for estimating a random effect, but
not impossible -- you may discover, depending on the amount of noise in
your data set, that the estimated across-site variance is zero. In
principle you can also look at the variation across sites in the effects
of factors (e.g. (1|site:cover) or (cover|site)), but you may run out of
data if you try to do too much of this ...
>
> Here is how part of my data (those for rev 1, site 1) look:
>
> rev site species repl cover treat seedlings
>
> 1 1 0 1 low A
> 3
>
> 1 1 0 1 low C
> 0
>
> 1 1 0 1 low E
> 0
>
> 1 1 0 2 low A
> 5
>
> 1 1 0 2 low C
> 0
>
> 1 1 0 2 low E
> 3
>
> 1 1 0 3 high A
> 1
>
> 1 1 0 3 high C
> 0
>
> 1 1 0 3 high E
> 0
>
> 1 1 0 4 high A
> 0
>
> 1 1 0 4 high C
>
> ...
>
> My first problem is that 66% of this count data (taking all the
> revisions together) are zeros (and into each revision the percentage
> of zeros is also high). Should I then fit a zero-inflated model? This
> is not available in lmer, and I investigated zeroinfl in the package
> pscl but it does not allow to include random effects (and, to my
> knowledge, I should include "site" as a random effect).
Maybe, but not necessarily. If the means are low enough then
zero-inflation may not be necessary. glmmADMB will fit zero-inflated
Poisson (or negative binomial) models with a single random effect (the
new version, coming soon, will allow multiple random effects).
> Poisson family
> is strongly suggested for count data, but using goodfit in the vcd
> package I found that my data are significantly different from the
> Poisson distribution. So which family should I use? I have read about
> the problem with quasi distributions and why they are not any more
> available in lmer. I have also read some posts talking about
> overdispersion and translating the model to a lognormal-Poisson model
> by using an individual-level random variable and using family=poisson.
> I have tried this with my data and it worked, but I am really not sure
> if I have overdispersion or not, as I do not know which parameter do I
> have to look at in order to quantify overdispersion.
In general you want to look at the residual deviance or sum of
squared Pearson residuals, although this is only approximate. Checking
whether the overall (marginal) distribution is Poisson doesn't mean very
much in general.
>
> Here is my poisson model for the first revision (I made subsets of the
> data, and data1 corresponds to the first revision):
>
> model1<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+
> (1+species+treat|site),family=poisson,data1)
My guess is that this is a bit too complicated in the random effects.
I would see first if this works:
model1<-lmer(seedlings~treat*(species+cover)+(1|site),
family=poisson,data1)
then try to work up from there, seeing which combinations of (1)
zero-inflation (2) overdispersion (via negative binomial in glmmADMB or
via lognormal-Poisson in lme4) (3) more complex random effects your data
can accommodate. Basically, you want to specify the most complex
*reasonable* model that your data can handle, then (if you are doing
hypothesis tests) drop selected terms from the full model (one at a
time) and evaluate the difference in likelihood/deviance.
I'm not clear what 'revisions' are, and whether you want to treat them
as random effects (i.e., experimental blocks), or whether you have
specific interests/hypotheses about how the effects changed between
revisions ...
>
> Then I created a new variable with a unique value for each observation
> and included this as an additional random-effect term
>
> data1$over <- 1:nrow(data1)
>
> And this is the lognormal-Poisson model:
>
> model2<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+(1+species+treat|site)+(1|over),data
> = data1,family = poisson)
>
> This is the summary for both models:
>
>> summary(model1)
> Generalized linear mixed model fit by the Laplace approximation
> Formula: seedlings ~ species + cover + treat + cover:treat +
> species:treat + (1 + species + treat | site)
> Data: data1
> AIC BIC logLik deviance
> 407.3 463.1 -184.7 369.3
> Random effects:
> Groups Name Variance Std.Dev. Corr
> site (Intercept) 0 0
> species1 0 0 NaN
> treatC 0 0 NaN NaN
> treatE 0 0 NaN NaN NaN
> Number of obs: 139, groups: site, 6
>
> Fixed effects:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 2.190e-01 2.374e-01 0.922 0.3563
> species1 -3.586e-01 2.830e-01 -1.267 0.2050
> coverlow 1.113e-01 2.803e-01 0.397 0.6913
> treatC -3.493e+01 2.432e+03 -0.014 0.9885
> treatE -1.136e-03 3.175e-01 -0.004 0.9971
> coverlow:treatC 1.761e+01 1.803e+03 0.010 0.9922
> coverlow:treatE 7.783e-01 3.534e-01 2.202 0.0277 *
> species1:treatC 1.768e+01 1.632e+03 0.011 0.9914
> species1:treatE 4.277e-01 3.451e-01 1.240 0.2152
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Correlation of Fixed Effects:
> (Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
> species1 -0.539
> coverlow -0.623 0.041
> treatC 0.000 0.000 0.000
> treatE -0.748 0.403 0.466 0.000
> covrlw:trtC 0.000 0.000 0.000 -0.741 0.000
> covrlw:trtE 0.494 -0.033 -0.793 0.000 -0.669 0.000
> species1:treatC 0.000 0.000 0.000 -0.671 0.000 0.000 0.000
> species1:treatE 0.442 -0.820 -0.034 0.000 -0.529 0.000 0.047 0.000
>> summary(model2)
> Generalized linear mixed model fit by the Laplace approximation
> Formula: seedlings ~ species + cover + treat + cover:treat +
> species:treat + (1 + species + treat | site) + (1 | over)
> Data: data1
> AIC BIC logLik deviance
> 238.9 297.6 -99.47 198.9
> Random effects:
> Groups Name Variance Std.Dev. Corr
> over (Intercept) 2.5371 1.5928
> site (Intercept) 0.0000 0.0000
> species1 0.0000 0.0000 NaN
> treatC 0.0000 0.0000 NaN NaN
> treatE 0.0000 0.0000 NaN NaN NaN
> Number of obs: 139, groups: over, 139; site, 6
>
> Fixed effects:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -0.8544 0.5720 -1.494 0.135
> species1 -0.8135 0.6783 -1.199 0.230
> coverlow 0.3388 0.6719 0.504 0.614
> treatC -35.3404 6943.1491 -0.005 0.996
> treatE 0.2528 0.7643 0.331 0.741
> coverlow:treatC 17.1976 5840.7446 0.003 0.998
> coverlow:treatE 0.8608 0.8823 0.976 0.329
> species1:treatC 17.2652 3754.0680 0.005 0.996
> species1:treatE 0.4864 0.8843 0.550 0.582
>
> Correlation of Fixed Effects:
> (Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
> species1 -0.530
> coverlow -0.631 0.035
> treatC 0.000 0.000 0.000
> treatE -0.748 0.396 0.472 0.000
> covrlw:trtC 0.000 0.000 0.000 -0.841 0.000
> covrlw:trtE 0.481 -0.027 -0.762 0.000 -0.635 0.000
> species1:treatC 0.000 0.000 0.000 -0.541 0.000 0.000 0.000
> species1:treatE 0.406 -0.767 -0.027 0.000 -0.528 0.000 0.020 0.000
>
> I don't know why I get this 0 and NaN for the random effects. I tried
> to look at the interaction of site with species an treatment, apart
> from the single effect of site. Perhaps my formula specification is
> not correct?
>
> Besides, I performed an anova with the two models:
>
>> anova(model1,model2)
> Data: data1
> Models:
> model1: seedlings ~ species + cover + treat + cover:treat +
> species:treat + (1 + species +
> model1: treat | site)
> model2: seedlings ~ species + cover + treat + cover:treat +
> species:treat + (1 + species +
> model2: treat | site) + (1 | over)
> Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
> model1 19 407.30 463.06 -184.651
> model2 20 238.93 297.62 -99.465 170.37 1 < 2.2e-16 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> So looking at this I believe I should choose model 2, but I still
> don’t see how can I know if there is an important overdispersion in
> model 1. And do you think this model 2 is acceptable having too many
> zeros in my data? Or should I try to fit any other kind of model? I
> believe that keeping site as a fixed effect and using zeroinfl in pscl
> package would be not correct at all… do you agree?
>
> And my last question is: do you think it is ok to fit a different
> model for each revision, or could I try some kind of repeated-measures
> approach? I don’t know if this is available for GLMMs, and it goes far
> beyond my statistical knowledge, but perhaps there is some possible
> way.
>
> Thanks a lot for your attention, and sorry for the long message. Any
> kind of help or hint would be very welcome!
>
> Regards,
>
> Alicia Valdés Rapado
> PhD Student
> Dpto. Biología Organismos y Sistemas - Unidad de Ecología
> Universidad de Oviedo
> España - Spain
> e-mail: valdesalicia.uo at uniovi.es
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk3EIn0ACgkQc5UpGjwzenNBlgCdGlY7lDGvE8/0k0uDpgpJiUYz
utEAnR4mUzF6tseDK/cmUqpsYxZzh2Dv
=kEMb
-----END PGP SIGNATURE-----
More information about the R-sig-mixed-models
mailing list