[R-sig-ME] Modelling proportion data in lme4

Thu Apr 6 18:17:01 CEST 2017

You could to try to investigate beta distribution but if you have 0s
or 1s, STAN implementation do not handle 0 or 1. Perhaps, zero
inflated or zero-one inflated beta distribution in STAN could provide
you some help.
"brms" package https://github.com/paul-buerkner/brms have implemented
zero-inflated beta distribution and Paul Buerkner have planned to
implement zero-one inflated beta distribution.

Cheers

Guillaume

2017-04-06 12:28 GMT+02:00 Adriana De Palma <A.De-Palma at nhm.ac.uk>:
> Dear Ramon and Thierry,
>
> Thank you very much for your suggestions. In answer to your questions:
>
>  - I do have 0s in my data.
>  - I don't think we can consider the denominator independent trails in this case. As it is the total abundance of a set of species, each species is 'block voting'.
>
> RE: Ramon's suggestion of a tweedie model for modelling the numerator as the response variable. The proportion data really is the measure that needs to be modelled as this is the compositional similarity calculation, so I'm not sure that a tweedie model will be suitable.
>
> RE Thierry's suggestion: Would it still be suitable to use to total abundance of species as the weights in a binomial model, even if the trials aren't strictly independent?
>
> Thanks both for all your help! Any further advice would be very gratefully received!
>
> Many thanks,
>
> Adriana
>
>
>
>
> -----Original Message-----
> From: Ramon Diaz-Uriarte [mailto:rdiaz02 at gmail.com]
> Sent: 01 April 2017 09:20
> To: Adriana De Palma
> Cc: r-sig-mixed-models at r-project.org
> Subject: Re: [R-sig-ME] Modelling proportion data in lme4
>
> Dear Adriana,
>
>
> On Thu, 30-03-2017, at 09:41, Adriana De Palma <A.De-Palma at nhm.ac.uk> wrote:
>> Dear all,
>>
>> I'd be really grateful if someone could advise on the following issue I've come across.
>>
>> I have proportion data (non-integer, bounded between 0 and 1) as my
>
> Do you actually have some 0s? Most of the rest of my answer assumes you do.
>
>
>> response variable, in a model that requires nested random effects and
>> weights, which makes lme4 the ideal choice. Using lme4 with a binomial
>
> You might want to take a look at:
>
>
> http://stats.stackexchange.com/questions/81343/response-variable-percentage-and-too-many-zeros-zero-inflated-poisson
>
> http://stats.stackexchange.com/questions/142038/two-part-models-in-r-continuous-outcome-with-too-many-zeros
>
> http://stats.stackexchange.com/questions/142013/correct-glmer-distribution-family-and-link-for-a-continuous-zero-inflated-data-s/
>
> and this R-help question (referred from the above questions, e.g. http://stats.stackexchange.com/a/81347):
>
> https://stat.ethz.ch/pipermail/r-help/2005-January/065070.html
>
> where using a Tweedie model is suggested.
>
>
> The cplm CRAN package, by W. Zhang:
> https://cran.r-project.org/web/packages/cplm/index.html
>
> will fit mixed-effects Tweedies.
>
>
> I'd suggesting checking the vignetted of the cplm package, as well as Zhang's paper
>
> http://link.springer.com/10.1007/s11222-012-9343-7
>
>
> and Dunn and Smyth's 2005 paper, which contains examples that use the Tweedie distribution, as well as several references in the literature where these models have been used:
>
> https://link.springer.com/article/10.1007/s11222-005-4070-y
>
>
>
> Take all of this advice with a grain (or two) of salt, but in somewhat similar cases, and when I had a structure of replicates that allowed me to examine the relationship between mean and variance in the response, I have used it to help me decide whether a Tweedie was, or not, a reasonable choice compared to other options; for instance, with the Tweedie model we'd expect to see a linear slope between log(variance) and log(mean), with the slope, p, being the exponent in the relationship V(mu) = mu^p (see, e.g., Figure 3 in the paper by Dunn and Smyth).
>
>
>
>> error structure and logit link seems to produce reasonable (and
>> realistic
>> looking) results, and the residual plots look good. However, it warns
>> me that the error structure expects integer data, and I don't know
>> whether this approach is doing what I think (and hope) that it is
>> doing. I have tried to validate the lme4 results in the following ways:
>>
>>
>> 1.  Running the same method (binomial error structure and logit link
>> with the proportions as the response variable) with glmmADMB. This
>> produces very different results (they are completely unrealistic, e.g.
>> predicted proportion of 2.16e-34).
>>
>> 2.  Using beta regression with glmmADMB. This seems to work and
>> produce results that are on the same scale, but not that close to those of lme4.
>>
>> 3.  Running an lme4 model with normal errors (lmer), after
>> logit-transforming the response variable. This again gives quite
>> different results to the lme4 model with binomial error structure and
>> logit link (and the behaviour of the residuals is not ideal).
>>
>> Since these all give different results, it's hard to tell whether the
>> lme4 method I've used is giving the 'right' answer. I would be really
>> grateful for any advice. Is lme4 correctly analysing the proportion
>> data when a binomial error structure and logit link are specified?
>>
>> Additional note: the proportion data are compositional similarity
>> measurements (Jaccard assymetric abundance-based compositional
>> similarity), so technically there is a numerator and denominator
>> (numerator = abundance of species in Site 1 that are also present in
>> Site 2; denominator = abundance of all species in Site 1). I've been
>> exploring different weights options, but they generally include the denominator.
>
> A couple of comments here:
>
> 1. I am not sure those proportion data can always be modelled as binomial.
> Is the numerator a quantity we can think of as arising from a number of independent trials, where the denominator is that number of independent trials?
>
>
> 2. You might consider modeling the numerator using the denominator not as denominator but as a covariate. This has the advantage of allowing you to examine different possible relationships such as
>
> Numerator ~  Denominator + other stuff
>
> but also
>
> Numerator ~ poly(Denominator, 2) + other stuff
>
> or
>
> Numerator ~ bs(Denominator) + other stuff
>
>
> and just generally things like
>
>
> Numerator ~ some_function_of(Denominator, some_other_covariates)
>
> such as
>
> Numerator ~ poly(Denominator, 2) * some_covariate
>
>
> etc.
>
>
> When you do
>
> Numerator/Denominator ~ other stuff
>
> you are committing yourself to one particular form of that relationship (which might not be easy to reason about).
>
>
>
> Best,
>
>
> R.
>
>
>
>>
>> Many thanks in advance,
>>
>> Adriana
>>
>>
>> _____
>>
>> Adriana De Palma
>> PREDICTS Postdoctoral Research Assistant Natural History Museum South
>> Kensington
>>
>> Web: The Purvis
>> Lab<http://www.bio.ic.ac.uk/research/apurvis/ajpurvis.htm> |
>> PREDICTS<predicts.org.uk>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
> --
> Ramon Diaz-Uriarte
> Department of Biochemistry, Lab B-25
> Facultad de Medicina
> Universidad Autónoma de Madrid
> Arzobispo Morcillo, 4
> 28029 Madrid
> Spain
>
> Phone: +34-91-497-2412
>
> Email: rdiaz02 at gmail.com
>        ramon.diaz at iib.uam.es
>
> http://ligarto.org/rdiaz
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models