[R] Zero inflated: is there a limit to the level of inflation

Tue Jun 26 23:46:19 CEST 2012

On Tue, 26 Jun 2012, Marc Schwartz wrote:

> On Jun 26, 2012, at 2:10 PM, SSimek wrote:
>
>> Hello,
>>
>> I have count data that illustrate the presence or absence of individuals in
>> my study population. I created a grid cell across the study area and
>> calcuated a count value for each individual per season per year for each
>> grid cell. The count value is the number of time an individual was present
>> in each grid cell.  For illustration my data columns look something like
>> this and are repeated for each individual:
>>
>> Cell_ID	Param1	Param2	Param3	Param4	COUNT	Name	Year	Season	Cov
>> 1	160.565994	729.08	1503	7930.3	0	AA	2010	AUT	Open
>> 1	160.565994	729.08	1503	7930.3	22	AA	2011	SPR	Open
>> 1	160.565994	729.08	1503	7930.3	12	AA	2009	SUM	Open
>> 1	160.565994	729.08	1503	7930.3	0	AA	2010	SUM	Open
>> 2	169.427001	491.87	1503.31	5101.09	0	AA	2010	AUT	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	16	AA	2011	SPR	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	0	AA	2009	SUM	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	0	AA	2010	SUM	oldHard
>> ?
>> 563	86.777099	612.69	977	4474.6	62	AA	2010	AUT	Water
>> 563	86.777099	612.69	977	4474.6	12	AA	2011	SPR	Water
>> 563	86.777099	612.69	977	4474.6	55	AA	2009	SUM	Water
>>
>>
>> 1	160.565994	729.08	1503	7930.3	0	BB	2010	SUM	Open
>> 2	169.427001	491.87	1503.31	5101.09	72	BB	2010	SUM	oldHard
>> 5	160.75	614.95	1503.31	2878.98	16	BB	2010	SUM	medHard
>> 6	170.404998	510.58	1489.44	743.14	0	BB	2010	SUM	Water
>> ?
>> 563	86.777099	612.69	977	4474.6	0	BB	2010	SUM	Water
>>
>>
>> 1	160.565994	729.08	1503	7930.3	14	C	2005	AUT	Open
>> 1	160.565994	729.08	1503	7930.3	0	C	2006	AUT	Open
>> 1	160.565994	729.08	1503	7930.3	0	C	2006	SPR	Open
>> 1	160.565994	729.08	1503	7930.3	56	C	2007	SPR	Open
>> 1	160.565994	729.08	1503	7930.3	0	C	2006	SUM	Open
>> 2	169.427001	491.87	1503.31	5101.09	124	C	2005	AUT	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	231	C	2006	AUT	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	889	C	2006	SPR	oldHard
>> 2	169.427001	491.87	1503.31	5101.09	0	C	2007	SPR	oldHard
>> ?
>> 563	86.777099		612.69	977	4474.6	0	C	2005	AUT	Water
>> 563	86.777099		612.69	977	4474.6	231	C	2006	AUT	Water
>> 563	86.777099		612.69	977	4474.6	185	C	2006	SPR	Water
>> 563	86.777099		612.69	977	4474.6	123	C	2007	SPR	Water
>> 563	86.777099		612.69	977	4474.6	52	C	2006	SUM	Water
>>
>>
>>
>> I have 563 grid cells across my study area and each individual has 1-563
>> cells associated for each year and each season the individual was monitored.
>> Therefore my grid cells are repeated. I end up with 71,000 records and 925
>> records have a Count value >0; which means 70,075 records have a Count value
>> = 0.
>>
>> I wanted to run a zero inflated poisson model to determine mixed effects (of
>> parameters) with individual as the random effect. But I have been advised
>> two things:
>>
>> 1. I cannot run a zero inflated poisson model because my data are too
>> "extremely" inflated (i.e. 70,075 vs 925) and
>>
>> 2. I cannot run the model with each cell repeated for each individual. I am
>> told the model doesn't recognize that Cell_ID #1 for individual "A" is the
>> same Cell_ID #1 for individual "B".
>>
>> Does anyone know if either or both of these points are true? I would
>> appreciate any thoughts, advice, or suggestions.
>>
>> Thanks!
>>
>> -Stephanie
>
>
> Hi Stephanie,
>
> Some comments:
>
> 1. You should think about or at least be open to a zero inflated negative binomial distribution rather than zero inflated poisson.
>
> 2. You should at least review the vignette for the pscl CRAN package, which provides standard fixed effects models and related functions for count based data and importantly, some good conceptual content:
>
>  http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf
>
> 3. Given the repeated measures framework and correlation issues you likely have, you should subscribe to and re-post your query to the R-sig-mixed-models list:
>
>  https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> which will avail you of experts in the field.
>
> 4. There is also a draft FAQ for mixed models here:
>
>  http://glmm.wikidot.com/faq
>
> which I believe is maintained by Ben Bolker, who actively participates in the above list. Based upon the content there, I suspect that you will be pointed to the glmmADMB package which is on R-Forge (http://glmmadmb.r-forge.r-project.org/) and can handle zero inflated mixed effects models of at least some types.
>
> 5. If all else fails, just to plant a seed, you might want to consider a 
> mixed effects logistic regression model with a binary response, since 
> you appear to have a relatively small "event" incidence in your data. 
> The above list will also be helpful in that setting and you would likely 
> be pointed to the glmer() function in the lme4 package for that 
> application, which provides for GLMs in a mixed effects framework.

Thanks, Marc, all very useful points! Just one addition:

I would recommend starting with the last point - a binary response 
regression (for y > 0). This could be considered as the zero-hurdle of a 
hurdle regression.

Hurdle regressions are an alternative to zero-inflated models, but have 
the nice property that you can separately estimate both parts of the 
hurdle: (1) a binary regression for y=0 vs. y > 0. (2) A truncated count 
model for y, estimated only from the observations y>0. The "pscl" package 
contains a hurdle() function which estimates both parts in one go (and the 
"countreg" vignette gives more details and references), but in this case 
it would probably be useful to estimate them separately.

In any case, both parts will need care because the binary response 
probably contains a lot of (quasi-)complete separations because non-zeros 
are so rare. Conversely, the truncated count model may be hard to estimate 
because there are no observations for a lot of parameter combinations. But 
estimating the models separately will give you more flexibility in 
addressing these issues.

To estimate the zero-truncated count distributions, you may consider the 
"countreg" package from R-Forge which uses the same code as (one part of) 
the hurdle() function.

hth,
Z

> Regards,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>