[R-sig-ME] Modelling proportion data in lme4

Sat Apr 1 10:19:33 CEST 2017

Dear Adriana,

On Thu, 30-03-2017, at 09:41, Adriana De Palma <A.De-Palma at nhm.ac.uk> wrote:
> Dear all,
>
> I'd be really grateful if someone could advise on the following issue I've come across.
>
> I have proportion data (non-integer, bounded between 0 and 1) as my

Do you actually have some 0s? Most of the rest of my answer assumes you do.

> response variable, in a model that requires nested random effects and
> weights, which makes lme4 the ideal choice. Using lme4 with a binomial

You might want to take a look at:

http://stats.stackexchange.com/questions/81343/response-variable-percentage-and-too-many-zeros-zero-inflated-poisson

http://stats.stackexchange.com/questions/142038/two-part-models-in-r-continuous-outcome-with-too-many-zeros

http://stats.stackexchange.com/questions/142013/correct-glmer-distribution-family-and-link-for-a-continuous-zero-inflated-data-s/

and this R-help question (referred from the above questions, e.g. http://stats.stackexchange.com/a/81347):

https://stat.ethz.ch/pipermail/r-help/2005-January/065070.html

where using a Tweedie model is suggested.

The cplm CRAN package, by W. Zhang:
https://cran.r-project.org/web/packages/cplm/index.html

will fit mixed-effects Tweedies.

I'd suggesting checking the vignetted of the cplm package, as well as
Zhang's paper

http://link.springer.com/10.1007/s11222-012-9343-7

and Dunn and Smyth's 2005 paper, which contains examples that use the
Tweedie distribution, as well as several references in the literature where
these models have been used:

https://link.springer.com/article/10.1007/s11222-005-4070-y

Take all of this advice with a grain (or two) of salt, but in somewhat
similar cases, and when I had a structure of replicates that allowed me to
examine the relationship between mean and variance in the response, I have
used it to help me decide whether a Tweedie was, or not, a reasonable
choice compared to other options; for instance, with the Tweedie model we'd
expect to see a linear slope between log(variance) and log(mean), with the
slope, p, being the exponent in the relationship V(mu) = mu^p (see, e.g.,
Figure 3 in the paper by Dunn and Smyth).

> error structure and logit link seems to produce reasonable (and realistic
> looking) results, and the residual plots look good. However, it warns me
> that the error structure expects integer data, and I don't know whether
> this approach is doing what I think (and hope) that it is doing. I have
> tried to validate the lme4 results in the following ways:
>
>
> 1.  Running the same method (binomial error structure and logit link with
> the proportions as the response variable) with glmmADMB. This produces
> very different results (they are completely unrealistic, e.g. predicted
> proportion of 2.16e-34).
>
> 2.  Using beta regression with glmmADMB. This seems to work and produce
> results that are on the same scale, but not that close to those of lme4.
>
> 3.  Running an lme4 model with normal errors (lmer), after
> logit-transforming the response variable. This again gives quite
> different results to the lme4 model with binomial error structure and
> logit link (and the behaviour of the residuals is not ideal).
>
> Since these all give different results, it's hard to tell whether the
> lme4 method I've used is giving the 'right' answer. I would be really
> grateful for any advice. Is lme4 correctly analysing the proportion data
> when a binomial error structure and logit link are specified?
>
> Additional note: the proportion data are compositional similarity
> measurements (Jaccard assymetric abundance-based compositional
> similarity), so technically there is a numerator and denominator
> (numerator = abundance of species in Site 1 that are also present in Site
> 2; denominator = abundance of all species in Site 1). I've been exploring
> different weights options, but they generally include the denominator.

A couple of comments here:

1. I am not sure those proportion data can always be modelled as binomial.
Is the numerator a quantity we can think of as arising from a number of
independent trials, where the denominator is that number of independent
trials?

2. You might consider modeling the numerator using the denominator not as
denominator but as a covariate. This has the advantage of allowing you to
examine different possible relationships such as

Numerator ~  Denominator + other stuff

but also

Numerator ~ poly(Denominator, 2) + other stuff

or

Numerator ~ bs(Denominator) + other stuff

and just generally things like

Numerator ~ some_function_of(Denominator, some_other_covariates)

such as

Numerator ~ poly(Denominator, 2) * some_covariate

etc.

When you do

Numerator/Denominator ~ other stuff

you are committing yourself to one particular form of that relationship
(which might not be easy to reason about).

Best,

R.

>
> Many thanks in advance,
>
> Adriana
>
>
> _____
>
> Adriana De Palma
> PREDICTS Postdoctoral Research Assistant
> Natural History Museum
> South Kensington
>
> Web: The Purvis Lab<http://www.bio.ic.ac.uk/research/apurvis/ajpurvis.htm> | PREDICTS<predicts.org.uk>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz