[R-sig-ME] Zero-inflated mixed effects model - clarification, of zeros modeled and R package questions

Highland Statistics Ltd highstat at highstat.com
Thu Jun 21 01:32:36 CEST 2012






------------------------------

Message: 5
Date: Wed, 20 Jun 2012 15:32:38 -0700
From: Jennifer Barrett <jenn.s.barrett at gmail.com>
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] Zero-inflated mixed effects model - clarification
	of zeros modeled and R package questions
Message-ID:
	<CAEbqvwo7a+hOLPcZ1BKUerSir+Xwcczk3AY5CeFbc80woLeM3A at mail.gmail.com>
Content-Type: text/plain

Hi folks,


I’m looking for some guidance in regards to zero-inflated models with
repeated measures (i.e., random effect for site). My first question is more
of a statistical one, while the second is related to R packages. Apologies
for the long post; however, I want to make sure my concerns/questions are
clear!


Our project and dataset:


- The aim of our project is to 1) examine associations between shoreline
habitat characteristics and the abundance of several shorebird species; and
2) estimate the total abundance of each shorebird species within the entire
study region based on the models from 1) above, with confidence intervals.
Note that we will be using an information theoretic approach for 1) above,
and would like to use MMI for 2).

- Our response dataset consists of counts of shorebirds at >150 coastal
sites, conducted on the second Sunday of each month between the months of



AFZ: No spatial correlation between sites?




Oct-March, over 10 years; however, not every site was surveyed in all
months (we’ve limited our dataset to those with a minimum of 3 counts in a
year).  Our response variable is thus the number of birds counted in a
given month/year at a given site. Note that we plan to model each year
separately.

-  The habitat dataset consists of shoreline units within our entire study
region, with each unit characterized by exposure, substrate type...etc.
Using GIS, we’ve measured the length of shoreline belonging to shoreline
categories (e.g., sand, rock, mud) within each survey site, the average
exposure for the site, and other continuous attributes, as well as one
presence/absence covariate.

- Initial exploratory analysis has shown that the counts are zero-inflated.
While there may be some false zeros in our dataset (i.e., observer error),
the source of the zero-inflation is likely preference of shorebirds for
particular sites with particular features and avoidance of others (i.e.,
true zeros or “structural zeros”). Some zeros likely also arise because the
species does not saturate its habitat (i.e., habitat suitable, but
unoccupied – also a “true” zero), though again, the majority of the zeros
are likely structural.


Onto my questions:


1) I’ve been reading through the literature to decide what type of model
would best be suited for our dataset and questions. While all articles seem
to agree that the choice of a model needs to consider the source of excess
zeros, they seem to contradict one another in regards to what zeros are
being modeled in each component of a zero-inflated mixture model. Note that
I am not considering a two-part (i.e., conditional) model, because I do not
believe that all zeros arise from the occupancy process (as per Joseph et
al. 2009 and as noted above, zero abundance can occur by chance in our
system). Examples:


- Martin et al. (2005) state that when zero inflation is due to true zeros,
two-part or mixture models (ZIP or ZINB) are recommended, and that when
zero inflation is due to false zeros, a ZIB mixture model is recommended;
however, when zero inflation is due to both excess true and false zeros, a
Bayesian framework may be used, though there is no formal discussion in the
literature. NOTE: Since this article was published, Royle’s N-mixture model
has addressed this issue; however, I cannot use this approach as my data do
not meet the assumption of a closed population during the study period.

- In contrast to Martin et al. (2005), Potts and Elith (2006) state that
the zero-inflated mixture model structure implies that zero observations
arising from the zero process are true negative observations, and that
those arising from the Poisson process are false negative observations “that
is, the habitat is suitable, but unoccupied” (p.155). However, on the
previous page, they defined false negative as “attributable to experimental
design
 or observer error”, and habitat that is “suitable, but unoccupied”
as a true negative, so I'm not sure which type of zero observation they are
really referring to here for the Poisson process.

- In contrast to both sources above, Zuur et al. (2009) state that in a ZIP
or ZINB, zeros are modeled as coming from two processes – the binomial
process, which models only false zeros (observer, design, and survey error)
and the Poisson (or Negbin) process  which models the true zeros and
counts. This is the opposite of what was stated by Potts and Elith.



AFZ: It doesn't read that contrasting..:-). The definition of true and false zeros will
change depending on the data sets and questions. In one setting a true zero could be a
false zero, and vice versa.



- Finally, I’ve read other sources which state that ZIPs simply treat the
population as a mixture, with one set of subjects having a zero response –
in other words, there is no mention of whether the zero process is modeling
the “true” or “false” zeros.


AFZ: True. See the discussion in the Epilogue of our 2012 book:
Zero Inflated Models and Generalized Linear Mixed Models with R. (2012)
Zuur, Saveliev, Ieno.
http://www.highstat.com/book4.htm

It is fully discussed in there. You can indeed view the ZIP as a weighted
average of two distributions...and there is no need for an interpretation
in terms of true and false zeros. True/false zeros make it a nicer story though!





Thinking about my system: there are a bunch of sites where the birds (of a
given species) never go (habitat is unsuitable), and a bunch where they do
go with varying levels of abundance (habitat is suitable, but come sites
are more favored than others, based on habitat features). Following the
last bullet above, a site that is suitable may have a count of zero simply
because the species wasn’t present there on the survey day (i.e., true zero
occurring by chance). Given the contradicting information above, and the
consensus on the importance of considering the source of zeros in model
selection, I would very much appreciate if someone could clear this up for
me - or let me know if I'm completely missing something here? Perhaps this
question should be posed on a stats forum, but given question 2 below, I
thought I'd try here first.



AFZ: Sounds like a ZIP or ZINB to me. But you may also need to add a spatially
correlated error term.




2) Assuming that I’m on the right track with a ZIP, is there a package I
can use to model a ZIP with a random effect for site? I looked at glmmADMB;
however, the zero inflation can only be modeled as a constant. This doesn’t
make sense for my system, as the zero-inflation will be a function of


AFZ: Welcome to WinBUGS or OpenBUGS. Actually...you have to be careful. I've noticed
that random effects and correlation structures may fight with the zero inflation
components in the model, depending where the zeros are. See again the reference above.




  habitat covariates (see above). Likewise, glmmPQL is not an option, as this
method does not yield log-likelihoods (and thus no AIC). I’m also thinking
that the random effect will have to be included in the zero process as well
– is this right?


AFZ: That is an option.....but is also a more complicated the model.


Alain Zuur


Many thanks, and my apologies if anything above is unclear.


Cheers,

JBW

	[[alternative HTML version deleted]]



------------------------------

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


End of R-sig-mixed-models Digest, Vol 66, Issue 29



More information about the R-sig-mixed-models mailing list