[R-sig-eco] proportion data with many zeros

Mon Feb 4 07:48:34 CET 2013

Liz writes:
> Hi Valerie,
> The best advice I was ever given with regards to distribution was to choose the one with the best fit i.e. no pattern in the residuals.
> The 2 things to think about when fitting a GLM are the type of data you've collected (binomial, counts etc) so that you can get an idea of which link will linearise 
> your model correctly and return realistic results (non negative etc).
> The second is to think about the mean-variance relationship. This is what will generally show up in the residuals. Gaussian assumes no relationship (constant) 
> but most proportion/abundance measures will have a variance which varies in some way with the mean. Try plotting your means against your variances and > 
> have a look at the share of the distribution of your raw data. Then experiment with some suitable exponential family distributions and see which residuals > > 
> have no pattern.

That's a good point. My variance is obviously not constant, because for some plantsno pollen was collected for some time points and I get null variance, 
whereas in other time points the variance is quite large. 

> 
> I think you're correct in not modeling the zeroes as a hurdle - as they are not 'unknowns'. 
> Proportion data is very tricky - I've been grappling with percent cover data for a while. Tweedie worked well for me for measures where cover values were 
>mid to low, but not well when they were close to 100%.
> If i were you, i'd consider changing the way you use the data to make it simpler. Perhaps just analyse each type of pollen individually over the time periods. I 
>assume each time period is the same for the samples and I think n=300 for each of the samples taken?

Yes, I may better use the raw counts. I anyway analyse each type of pollen individually. For some pollen types which are regularly sampled, the quasipoisson 
model works well, but I get problem with pollen types that are rarely sampled or not at all at some points in time. 
Is there a way to account for differences in mean-variance relationships in quasipoisson or negative binomial data?
When I run my models with quasipoisson, the summary suggests absolutely not significant results, but when I apply the F test (as suggested in Zuur), I get a 
highly significant outcome.

> So why not just try a quasi poisson (or negative binomial) and a tweedie GLM for each type of pollen separately vs time and see which has better residuals. 
It's much easier to treat these as counts - and no need to do proportions if the n is the same for all.
> Then you can get a significance value for the abundance of each pollen type with each bee at each time period. It is really the same as finding out the relative 
>proportions. 

> Package tweedie on R works pretty much the same as any GLM. You just need a little but of code (in help files) to estimate an alpha (shape) parameter for 
>each set of values. It should lie between 1-2. If not, your data is prob not suited.

I will try it.

Thank you very much
Best wishes

Valérie

> Let me know if you need any more help.
> Liz
> 

> On 04/02/2013, at 2:10 AM, v_coudrain at voila.fr wrote:
> 
> > Thank you Liz, 
> > I don't know tweedie, I'll have a look at it, but I have indeed some high values. I know about the problems linked to the arcsine transformation. I won't consider 
it 
> > anyway. I'd like to use either the raw values of pollen grain counts or a logistic quasibinomial model. 
> > Best,
> > Valérie
> > 
> > 
> >> Message du 02/02/13 à 20h47
> >> De : "Liz Pryde" 
> >> A : "v_coudrain at voila.fr" 
> >> Copie à : "Cade Brian" , "r-sig-ecology at r-project.org" 
> >> Objet : Re: [R-sig-eco] proportion data with many zeros
> >> 
> >> Have you plotted the raw data to have a look at the distribution?
> >> You could try another exponential family distribution like tweedie that has a mass at zero but is otherwise similar to poisson/gamma - so you're directly
> > modeling the zeroes. It won't work if you have a lot of high values though. 
> >> Proportions are tricky. Have a read of the Warton paper (2012/11?) "the arcsine is asinine".
> >> 
> >> Liz
> >> 
> >> 
> >> 
> >> On 02/02/2013, at 6:34 PM, v_coudrain at voila.fr wrote:
> >> 
> >>> Thank you very much for this suggestion. In fact I reconsidered my question and I am not sure that zero-inflated model is what I need. If I understood it
> > properly, 
> >>> a zero-inflated model is best suited when we don't know if zero values are true or false absences (right?). In my case all zero values are assumed to be 
real 
> >>> absence and are therefore informative. However, fitting quasipoisson on raw counts or quasibinomial on proportion gives me awful distributions of 
residuals
> > and 
> >>> meaningless results. 
> >>> 
> >>> Valérie
> >>> 
> >>> 
> >>>> Message du 01/02/13 à 17h22
> >>>> De : "Cade, Brian" 
> >>>> A : v_coudrain at voila.fr
> >>>> Copie à : r-sig-ecology at r-project.org
> >>>> Objet : Re: [R-sig-eco] proportion data with many zeros
> >>>> 
> >>>> For a fully parametric approach, you might want to use of zero-inflated
> >>>> beta distribution (e.g., as available in gamlss package), which is designed
> >>>> for zero-inflated proportions. Or for a semi-parametric approach, you
> >>>> could estimated a sequence of quantile regression estimates (e.g., in
> >>>> package quantreg), where some interval (hopefully not to large) of the
> >>>> quantiles will be uninformative because they are massed at the zero values.
> >>>> 
> >>>> Brian
> >>>> 
> >>>> Brian S. Cade, PhD
> >>>> 
> >>>> U. S. Geological Survey
> >>>> Fort Collins Science Center
> >>>> 2150 Centre Ave., Bldg. C
> >>>> Fort Collins, CO 80526-8818
> >>>> 
> >>>> email: brian_cade at usgs.gov
> >>>> tel: 970 226-9326
> >>>> 
> >>>> 
> >>>> 
> >>>> On Fri, Feb 1, 2013 at 1:30 AM, wrote:
> >>>> 
> >>>>> Dear all, I am trying to test how the proportion of pollen of different
> >>>>> plants found in the brood cells of a wild bee changes over time. I
> >>>>> conducted 4 sampling sessions
> >>>>> (thus time is a factor with 4 levels) and collected several pollen samples
> >>>>> for each time point (300 pollen grains counted for each sample). I thought
> >>>>> about applying a
> >>>>> quasi-binomial glm:
> >>>>> 
> >>>>> y = cbind(total pollen - pollen of plant X, pollen of plant X)
> >>>>> 
> >>>>> glm(y~time, family=quasibinomial)
> >>>>> 
> >>>>> The problem is that I have a lot of zero value, because the pollen of some
> >>>>> plants only occurred rarely or very clumped in time. I thought about
> >>>>> applying a zero-inflated
> >>>>> model, but I have never used it and I am not sure if it is suitable for
> >>>>> proportion data. Additionally I wondered if I have to consider the fact
> >>>>> that I don't have the same
> >>>>> number of pollen sample for each date, which makes my design unbalanced.
> >>>>> Thank you in advance for advice.
> >>>>> 
> >>>>> Best wishes
> >>>>> Valérie
> >>>>> ___________________________________________________________
> >>>>> CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr
> >>>>> http://sports.voila.fr/football/can/
> >>>>> 
> >>>>> _______________________________________________
> >>>>> R-sig-ecology mailing list
> >>>>> R-sig-ecology at r-project.org
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> >>> 
> >>> ___________________________________________________________
> >>> CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr http://sports.voila.fr/football/can/
> >>> 
> >>> _______________________________________________
> >>> R-sig-ecology mailing list
> >>> R-sig-ecology at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> > 
> > ___________________________________________________________
> > CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr http://sports.voila.fr/football/can/
> 
___________________________________________________________
CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr http://sports.voila.fr/football/can/