[R-sig-eco] proportion data with many zeros

Liz Pryde elizabethpryde at gmail.com
Sun Feb 3 23:44:40 CET 2013


Hi Valerie,
The best advice I was ever given with regards to distribution was to choose the one with the best fit i.e. no pattern in the residuals.
The 2 things to think about when fitting a GLM are the type of data you've collected (binomial, counts etc) so that you can get an idea of which link will linearise your model correctly and return realistic results (non negative etc).
The second is to think about the mean-variance relationship. This is what will generally show up in the residuals. Gaussian assumes no relationship (constant) but most proportion/abundance measures will have a variance which varies in some way with the mean. Try plotting your means against your variances and have a look at the share of the distribution of your raw data.  Then experiment with some suitable exponential family distributions and see which residuals have no pattern.

I think you're correct in not modeling the zeroes as a hurdle - as they are not 'unknowns'. 
Proportion data is very tricky - I've been grappling with percent cover data for a while. Tweedie worked well for me for measures where cover values were mid to low, but not well when they were close to 100%.
If i were you, i'd consider changing the way you use the data to make it simpler. Perhaps just analyse each type of pollen individually over the time periods. I assume each time period is the same for the samples and I think n=300 for each of the samples taken?

So why not just try a quasi poisson (or negative binomial) and a tweedie GLM for each type of pollen separately vs time and see which has better residuals. It's much easier to treat these as counts - and no need to do proportions if the n is the same for all.
Then you can get a significance value for the abundance of each pollen type with each bee at each time period. It is really the same as finding out the relative proportions. 

Package tweedie on R works pretty much the same as any GLM. You just need a little but of code (in help files) to estimate an alpha (shape) parameter for each set of values. It should lie between 1-2. If not, your data is prob not suited.

Let me know if you need any more help.
Liz




On 04/02/2013, at 2:10 AM, v_coudrain at voila.fr wrote:

> Thank you Liz, 
> I don't know tweedie, I'll have a look at it, but I have indeed some high values. I know about the problems linked to the arcsine transformation. I won't consider it 
> anyway. I'd like to use either the raw values of pollen grain counts or a logistic quasibinomial model. 
> Best,
> Valérie
> 
> 
>> Message du 02/02/13 à 20h47
>> De : "Liz Pryde" 
>> A : "v_coudrain at voila.fr" 
>> Copie à : "Cade Brian" , "r-sig-ecology at r-project.org" 
>> Objet : Re: [R-sig-eco] proportion data with many zeros
>> 
>> Have you plotted the raw data to have a look at the distribution?
>> You could try another exponential family distribution like tweedie that has a mass at zero but is otherwise similar to poisson/gamma - so you're directly
> modeling the zeroes. It won't work if you have a lot of high values though. 
>> Proportions are tricky. Have a read of the Warton paper (2012/11?) "the arcsine is asinine".
>> 
>> Liz
>> 
>> 
>> 
>> On 02/02/2013, at 6:34 PM, v_coudrain at voila.fr wrote:
>> 
>>> Thank you very much for this suggestion. In fact I reconsidered my question and I am not sure that zero-inflated model is what I need. If I understood it
> properly, 
>>> a zero-inflated model is best suited when we don't know if zero values are true or false absences (right?). In my case all zero values are assumed to be real 
>>> absence and are therefore informative. However, fitting quasipoisson on raw counts or quasibinomial on proportion gives me awful distributions of residuals
> and 
>>> meaningless results. 
>>> 
>>> Valérie
>>> 
>>> 
>>>> Message du 01/02/13 à 17h22
>>>> De : "Cade, Brian" 
>>>> A : v_coudrain at voila.fr
>>>> Copie à : r-sig-ecology at r-project.org
>>>> Objet : Re: [R-sig-eco] proportion data with many zeros
>>>> 
>>>> For a fully parametric approach, you might want to use of zero-inflated
>>>> beta distribution (e.g., as available in gamlss package), which is designed
>>>> for zero-inflated proportions. Or for a semi-parametric approach, you
>>>> could estimated a sequence of quantile regression estimates (e.g., in
>>>> package quantreg), where some interval (hopefully not to large) of the
>>>> quantiles will be uninformative because they are massed at the zero values.
>>>> 
>>>> Brian
>>>> 
>>>> Brian S. Cade, PhD
>>>> 
>>>> U. S. Geological Survey
>>>> Fort Collins Science Center
>>>> 2150 Centre Ave., Bldg. C
>>>> Fort Collins, CO 80526-8818
>>>> 
>>>> email: brian_cade at usgs.gov
>>>> tel: 970 226-9326
>>>> 
>>>> 
>>>> 
>>>> On Fri, Feb 1, 2013 at 1:30 AM, wrote:
>>>> 
>>>>> Dear all, I am trying to test how the proportion of pollen of different
>>>>> plants found in the brood cells of a wild bee changes over time. I
>>>>> conducted 4 sampling sessions
>>>>> (thus time is a factor with 4 levels) and collected several pollen samples
>>>>> for each time point (300 pollen grains counted for each sample). I thought
>>>>> about applying a
>>>>> quasi-binomial glm:
>>>>> 
>>>>> y = cbind(total pollen - pollen of plant X, pollen of plant X)
>>>>> 
>>>>> glm(y~time, family=quasibinomial)
>>>>> 
>>>>> The problem is that I have a lot of zero value, because the pollen of some
>>>>> plants only occurred rarely or very clumped in time. I thought about
>>>>> applying a zero-inflated
>>>>> model, but I have never used it and I am not sure if it is suitable for
>>>>> proportion data. Additionally I wondered if I have to consider the fact
>>>>> that I don't have the same
>>>>> number of pollen sample for each date, which makes my design unbalanced.
>>>>> Thank you in advance for advice.
>>>>> 
>>>>> Best wishes
>>>>> Valérie
>>>>> ___________________________________________________________
>>>>> CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr
>>>>> http://sports.voila.fr/football/can/
>>>>> 
>>>>> _______________________________________________
>>>>> R-sig-ecology mailing list
>>>>> R-sig-ecology at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>> 
>>> ___________________________________________________________
>>> CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr http://sports.voila.fr/football/can/
>>> 
>>> _______________________________________________
>>> R-sig-ecology mailing list
>>> R-sig-ecology at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> 
> ___________________________________________________________
> CAN 2013 : résultats et matchs en direct à suivre sur Voila.fr http://sports.voila.fr/football/can/



More information about the R-sig-ecology mailing list