[R-sig-ME] Modelling proportion data in lme4
Ramon Diaz-Uriarte
rdiaz02 at gmail.com
Fri Apr 7 10:00:03 CEST 2017
Dear Adriana,
On Thu, 06-04-2017, at 16:17:01, guillaume chaumet <guillaumechaumet at gmail.com> wrote:
> You could to try to investigate beta distribution but if you have 0s
> or 1s, STAN implementation do not handle 0 or 1. Perhaps, zero
> inflated or zero-one inflated beta distribution in STAN could provide
> you some help.
> "brms" package https://github.com/paul-buerkner/brms have implemented
> zero-inflated beta distribution and Paul Buerkner have planned to
> implement zero-one inflated beta distribution.
>
> Cheers
>
> Guillaume
>
> 2017-04-06 12:28 GMT+02:00 Adriana De Palma <A.De-Palma at nhm.ac.uk>:
>> Dear Ramon and Thierry,
>>
>> Thank you very much for your suggestions. In answer to your questions:
>>
>> - I do have 0s in my data.
>> - I don't think we can consider the denominator independent trails in this case. As it is the total abundance of a set of species, each species is 'block voting'.
>>
>> RE: Ramon's suggestion of a tweedie model for modelling the numerator as
>> the response variable. The proportion data really is the measure that
>> needs to be modelled as this is the compositional similarity
>> calculation, so I'm not sure that a tweedie model will be suitable.
I cannot say if it is reasonable in your case, of course, but in one of the
linked emails below, a Tweedie was recommended for percentage data between
0 and 100% (with the caveat that it would be better if there were few or
no 100% )
Best,
R.
>>
>> RE Thierry's suggestion: Would it still be suitable to use to total abundance of species as the weights in a binomial model, even if the trials aren't strictly independent?
>>
>> Thanks both for all your help! Any further advice would be very gratefully received!
>>
>> Many thanks,
>>
>> Adriana
>>
>>
>>
>>
>> -----Original Message-----
>> From: Ramon Diaz-Uriarte [mailto:rdiaz02 at gmail.com]
>> Sent: 01 April 2017 09:20
>> To: Adriana De Palma
>> Cc: r-sig-mixed-models at r-project.org
>> Subject: Re: [R-sig-ME] Modelling proportion data in lme4
>>
>> Dear Adriana,
>>
>>
>> On Thu, 30-03-2017, at 09:41, Adriana De Palma <A.De-Palma at nhm.ac.uk> wrote:
>>> Dear all,
>>>
>>> I'd be really grateful if someone could advise on the following issue I've come across.
>>>
>>> I have proportion data (non-integer, bounded between 0 and 1) as my
>>
>> Do you actually have some 0s? Most of the rest of my answer assumes you do.
>>
>>
>>> response variable, in a model that requires nested random effects and
>>> weights, which makes lme4 the ideal choice. Using lme4 with a binomial
>>
>> You might want to take a look at:
>>
>>
>> http://stats.stackexchange.com/questions/81343/response-variable-percentage-and-too-many-zeros-zero-inflated-poisson
>>
>> http://stats.stackexchange.com/questions/142038/two-part-models-in-r-continuous-outcome-with-too-many-zeros
>>
>> http://stats.stackexchange.com/questions/142013/correct-glmer-distribution-family-and-link-for-a-continuous-zero-inflated-data-s/
>>
>> and this R-help question (referred from the above questions, e.g. http://stats.stackexchange.com/a/81347):
>>
>> https://stat.ethz.ch/pipermail/r-help/2005-January/065070.html
>>
>> where using a Tweedie model is suggested.
>>
>>
>> The cplm CRAN package, by W. Zhang:
>> https://cran.r-project.org/web/packages/cplm/index.html
>>
>> will fit mixed-effects Tweedies.
>>
>>
>> I'd suggesting checking the vignetted of the cplm package, as well as Zhang's paper
>>
>> http://link.springer.com/10.1007/s11222-012-9343-7
>>
>>
>> and Dunn and Smyth's 2005 paper, which contains examples that use the Tweedie distribution, as well as several references in the literature where these models have been used:
>>
>> https://link.springer.com/article/10.1007/s11222-005-4070-y
>>
>>
>>
>> Take all of this advice with a grain (or two) of salt, but in somewhat similar cases, and when I had a structure of replicates that allowed me to examine the relationship between mean and variance in the response, I have used it to help me decide whether a Tweedie was, or not, a reasonable choice compared to other options; for instance, with the Tweedie model we'd expect to see a linear slope between log(variance) and log(mean), with the slope, p, being the exponent in the relationship V(mu) = mu^p (see, e.g., Figure 3 in the paper by Dunn and Smyth).
>>
>>
>>
>>> error structure and logit link seems to produce reasonable (and
>>> realistic
>>> looking) results, and the residual plots look good. However, it warns
>>> me that the error structure expects integer data, and I don't know
>>> whether this approach is doing what I think (and hope) that it is
>>> doing. I have tried to validate the lme4 results in the following ways:
>>>
>>>
>>> 1. Running the same method (binomial error structure and logit link
>>> with the proportions as the response variable) with glmmADMB. This
>>> produces very different results (they are completely unrealistic, e.g.
>>> predicted proportion of 2.16e-34).
>>>
>>> 2. Using beta regression with glmmADMB. This seems to work and
>>> produce results that are on the same scale, but not that close to those of lme4.
>>>
>>> 3. Running an lme4 model with normal errors (lmer), after
>>> logit-transforming the response variable. This again gives quite
>>> different results to the lme4 model with binomial error structure and
>>> logit link (and the behaviour of the residuals is not ideal).
>>>
>>> Since these all give different results, it's hard to tell whether the
>>> lme4 method I've used is giving the 'right' answer. I would be really
>>> grateful for any advice. Is lme4 correctly analysing the proportion
>>> data when a binomial error structure and logit link are specified?
>>>
>>> Additional note: the proportion data are compositional similarity
>>> measurements (Jaccard assymetric abundance-based compositional
>>> similarity), so technically there is a numerator and denominator
>>> (numerator = abundance of species in Site 1 that are also present in
>>> Site 2; denominator = abundance of all species in Site 1). I've been
>>> exploring different weights options, but they generally include the denominator.
>>
>> A couple of comments here:
>>
>> 1. I am not sure those proportion data can always be modelled as binomial.
>> Is the numerator a quantity we can think of as arising from a number of independent trials, where the denominator is that number of independent trials?
>>
>>
>> 2. You might consider modeling the numerator using the denominator not as denominator but as a covariate. This has the advantage of allowing you to examine different possible relationships such as
>>
>> Numerator ~ Denominator + other stuff
>>
>> but also
>>
>> Numerator ~ poly(Denominator, 2) + other stuff
>>
>> or
>>
>> Numerator ~ bs(Denominator) + other stuff
>>
>>
>> and just generally things like
>>
>>
>> Numerator ~ some_function_of(Denominator, some_other_covariates)
>>
>> such as
>>
>> Numerator ~ poly(Denominator, 2) * some_covariate
>>
>>
>> etc.
>>
>>
>> When you do
>>
>> Numerator/Denominator ~ other stuff
>>
>> you are committing yourself to one particular form of that relationship (which might not be easy to reason about).
>>
>>
>>
>> Best,
>>
>>
>> R.
>>
>>
>>
>>>
>>> Many thanks in advance,
>>>
>>> Adriana
>>>
>>>
>>> _____
>>>
>>> Adriana De Palma
>>> PREDICTS Postdoctoral Research Assistant Natural History Museum South
>>> Kensington
>>>
>>> Web: The Purvis
>>> Lab<http://www.bio.ic.ac.uk/research/apurvis/ajpurvis.htm> |
>>> PREDICTS<predicts.org.uk>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>>
>> --
>> Ramon Diaz-Uriarte
>> Department of Biochemistry, Lab B-25
>> Facultad de Medicina
>> Universidad Autónoma de Madrid
>> Arzobispo Morcillo, 4
>> 28029 Madrid
>> Spain
>>
>> Phone: +34-91-497-2412
>>
>> Email: rdiaz02 at gmail.com
>> ramon.diaz at iib.uam.es
>>
>> http://ligarto.org/rdiaz
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain
Phone: +34-91-497-2412
Email: rdiaz02 at gmail.com
ramon.diaz at iib.uam.es
http://ligarto.org/rdiaz
More information about the R-sig-mixed-models
mailing list