[R-sig-eco] hurdle model

Fri Aug 20 18:53:57 CEST 2010

Gavin et al.,

I read the postings in Digest form, so if my approach to responding
here by Reply to the Digest with an edited Subject line screws up
threading I apologize (and suggestions are welcome on a different
approach).

For some time I have been keenly interested in the ZIP and hurdle
models for count data, as in fisheries we often end up in this area as
a modeling strategy (or should anyway). In particular, the ability to
model the zeros (observed absences) as false absences in the sense
described in the referenced paper (site was observed as zero but had
suitable habitat) is a big advantage, as opposed to "forcing" a zero
(species cannot occur there).

However, this strategy still fails to deal with the other possibility
of why a false absence is observed -- detection probability, or the
probability that the survey method simply missed the critter and it
was in fact there. This can be a major bias, particularly in habitat
use analyses for animals. Is anyone aware of a paper comparing this
modeling strategy to occupancy models that deal with detection
probability directly through repeated surveys (as in
http://www.proteus.co.nz/OccWorkshop.html)? This would require a data
set suitable for occupancy models (repeated surveys) that could then
perhaps be compared to using a ZIP or hurdle model on the first survey
data.

Thanks for your thoughts,
Dave Hewitt
----------
Research Fishery Biologist
USGS Western Fisheries Research Center
Klamath Falls Field Station, Oregon
http://profile.usgs.gov/dhewitt

> Date: Thu, 19 Aug 2010 12:54:01 +0100
> From: Gavin Simpson <gavin.simpson at ucl.ac.uk>
> To: Yingjie Zhang <liv.zhangcn at gmail.com>
> Cc: r-sig-ecology at r-project.org
> Subject: Re: [R-sig-eco] hurdle model
>
> On Thu, 2010-08-19 at 13:20 +0200, Yingjie Zhang wrote:
>> Thanks for the details, the paper is 'Comparing species abundance
>> models' by Joanne M.Potts, Jane Elith.  Click the link... on page 158,
>> in the table, they compare 5 models, both Quasi-likelihood and Hurdle
>> are mentioned.
>>
>> http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VBS-4KD5C2N-1&_user=794998&_coverDate=11%2F16%2F2006&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1435498227&_rerunOrigin=google&_acct=C000043466&_version=1&_urlVersion=0&_userid=794998&md5=fc0c4ebc77917948c90f8f0ee3bbe141
>>
>> Maybe we went too far, when I read the paper above, I just thought it
>> would be interesting to try the method they mentioned. My data have
>> both characteristics: excess 0s and over dispersion of  positive part.
>> And I am quite  convinced that the 0s have a single source ... that's
>> why I didn't use ZIP/ZINB.
>>
>> Maybe for the excess 0s, over-dispersion and one source of 0s, the
>> best model is Hurdle with truncated negative binomial, but my motive
>> is to make sure that which ML method that Hurdle use.
>
> They fit several models and compare them:
>
>     I. Poisson
>    II. Negative Binomial
>   III. Quasi-likelihood
>    IV. Hurdle model
>     V. zero-inflated model
>
> III should be a quasi-poisson model, i.e. you fit the Poisson GLM using
> quasi-likelihood and model the dispersion parameter \phi alongside the
> usual Poisson GLM parameters.
>
> Section 2.3 of their paper on the hurdle model doesn't even mention
> "quasi". Though they do mention this in Table2.
>
> Reading this, I think they cooked this model themselves - you can fit a
> binomial model yourself for the presence absence and then fit a count
> model for the samples predicted to be present from the binomial part. To
> make things simple I suspect they fitted the count part as quasi-Poisson
> but no-where does it say exactly what they did.
>
> You would be better off fitting the hurdle as I mentioned using hurdle()
> in pscl; fitting things using quasi-likelihood is just asking for
> trouble if there are proper likelihood options available.
>
> Read the vignette that accompanies the pscl package for details of how
> it fits the various models including the hurdle. This includes the
> likelihood functions that are optimised as part of the fitting.
>
> HTH
>
> G
>> On 19 Aug 2010, at 11:49, Gavin Simpson wrote:
>>
>> > On Thu, 2010-08-19 at 11:14 +0200, Yingjie Zhang wrote:
>> >> Hi,
>> >>
>> >> There is a reason why am I addict to Quasi likelihood, since Hurdle
>> >> from 'pscl' use Zero Truncated Poisson regression for the non-zero
>> >> part, which incapable of handling the over-disperson comes from the
>> >> positive part of the data. Apparently, Quasi likelihood is at least a
>> >> better choice. I've noticed the hurdle they used for the paper comes
>> >> from package 'stats' instead of 'pscl', I didn't find this version of
>> >> hurdle in r...
>> >
>> > Quasi-likelihood isn't solving the "over-dispersion comes from positive
>> > part". It is a means of fitting models, just like maximum likelihood
>> > etc. It will be the authors model that does the accounting for over
>> > dispersion. They solve the parameters of this model using
>> > quasi-likelihood.
>> >
>> > Your claim about hurdle in stats is incorrect:
>> >
>> >> getAnywhere(hurdle)
>> > no object named ?hurdle? was found
>> >> getAnywhere("hurdle")
>> > no object named ?hurdle? was found
>> >
>> > So they must be using something else. Here's a thought; why not give us
>> > the reference/citation for the paper you are reading --- it is difficult
>> > to speculate further without more details like the actual paper?
>> >
>> > Hurdle models fit a point mass at zero, whilst the count part of the
>> > model is truncated to not allow any further zeros be produced from it.
>> >
>> > A zeroinflated (zeroinfl() in pscl) model fits a point mass at zero and
>> > has an untruncated count model which will allow extra zeros be produced.
>> >
>> > In both cases a negative binomial model may be fitted to the count part,
>> > which may be sufficient to cope with remaining overdispersion in the
>> > count part of your model.
>> >
>> > I think you would be better off thinking where the overdispersion is
>> > coming from and choosing an appropriate means to model it. You are being
>> > blinded by this talk of quasi-likelihoods. There may well be a way of
>> > fitting the model you want in R without resorting to quasi-likelihood
>> > tricks. But as you haven't told us what model you want to fit or a
>> > citation for the paper you want to replicate, there isn't much further
>> > we can do.
>> >
>> > HTH
>> >
>> > G
>> >
>> >> On 19 Aug 2010, at 10:55, Gavin Simpson wrote:
>> >>
>> >>> On Thu, 2010-08-19 at 10:30 +0200, Yingjie Zhang wrote:
>> >>>> I'd like to try the same way to my dataset, hurdle but estimated by
>> >>>> 'quasi-likelihood', but it's not in the standard 'pscl' package I
>> >>>> think, right?
>> >>>
>> >>> Please keep discussion on list; just because I replied doesn't give you
>> >>> a direct line to my inbox...
>> >>>
>> >>> Why would you want a quasi-likelihood when you could have the real
>> >>> thing? Seriously, if there is no likelihood you can't do likelihood
>> >>> ratio tests, compare models using AIC/BIC etc.
>> >>>
>> >>> Just use hurdle() if it fits the form of model you are after. don't
>> >>> worry about likelihoods, quasi or otherwise. Check you are happy with
>> >>> the range of models you could fit with hurdle() and use it. If you
>> >>> aren't happy then you'd need to look elsewhere, but don't get hung up on
>> >>> the quasi-likelihood bit.
>> >>>
>> >>> My Tuppence,
>> >>>
>> >>> G
>> >>>
>> >>>> On 19 Aug 2010, at 10:17, Gavin Simpson wrote:
>> >>>>
>> >>>>> On Thu, 2010-08-19 at 09:52 +0200, Yingjie Zhang wrote:
>> >>>>>> Hello everyone,
>> >>>>>>
>> >>>>>> Does anyone of you using hurdle model? I am reading a paper which said
>> >>>>>> " Hurdle model removes effect of zero-inflation and over-dispersion in
>> >>>>>> the non-zero observations using a quasi-likelihood", I've checked the
>> >>>>>> help file from hurdle in R, which said differently that"for non-zero
>> >>>>>> obs normally a truncated poisson/NB is used" ... just want to make
>> >>>>>> sure, does it really estimated the parameter by "quasi-likelihood"
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Yingjie Zhang
>> >>>>>> Biostatistician
>> >>>>>
>> >>>>> The authors of that paper might have fitted their hurdle model using a
>> >>>>> quasi likelihood but that is not, AFAICT, what is used in the hurdle()
>> >>>>> function in package 'pscl', which maximises a proper log likelihood.
>> >>>>>
>> >>>>> But hard to say from what you have provided.
>> >>>>>
>> >>>>> G