[R-sig-ME] modelling proportions, with aggregated data, and the new/old lme4

Ben Bolker bbolker at gmail.com
Sun Mar 18 19:10:15 CET 2012


Malcolm Fairbrother <m.fairbrother at ...> writes:

> 

[snip]

> I am trying to model Census data, where individual responses to a
> question with two answer options have been aggregated up to the
> community level, and communities are nested in subdistricts. So the
> outcome of interest is a proportion (ranging from 0 to 1), and I
> know the absolute number of "successes" and "failures". For each
> community, the data are available in a slightly disaggregated form
> for different categories of people (I'll use the separate numbers of
> successes and failures for women and men as an example). Thus the
> data look like:

> head(dat)
>   successes failures id sex subdist
> 1       560      726  1   F       4
> 2       844      510  1   M       4
> 3       340      438  2   F       4
> 4       616      273  2   M       4
> 5         7        0  3   F       4
> 6         3        1  3   M       4
> 
> In community #1, which is in subdistrict #4, there are 560 women 

  you mean 510, right?

> who are "successes" and 726 who are "failures" on this social
> indicator, etc. (The data are available via a link below.)
 
> How should I model these data? I was originally thinking to use a
> binomial distribution, and a call defining the outcome as
> "cbind(successes, failures)", with each observation/row nested in
> (a) "id" and (b) "subdist".
 
> (The idea of nesting the separate Census categories like this, and
> using a multilevel approach, comes from:
> http://www.hsph.harvard.edu/faculty/
>     sv-subramanian/files/epa2001_33_3_399_417.pdf.
> There was a recent discussion of the use of "cbind" like this, but
> it seems workable in this case as a way to reduce tends of millions
> of observations to a few tens of thousands, with no loss of
> information:
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2012q1/007552.html.)

 
> However, I gather that the binomial distribution is only appropriate
> where one is sampling a number of Bernoulli trials (i.e., people in
> this case) with replacement, making them independent. As I
> understand it, even if the sampling is not actually done with
> replacement, if N >> n (say at least 10x larger) then the binomial
> approach can be justified, as the trials are independent enough to
> be treated as such. However, in my case, I have no sample at all--I
> have data on the population--so the binomial approach appears to be
> a non-starter. Is that right?

   Hmmm. For better or worse, I've seen the binomial distribution
used in (I think) a reasonable way in cases where strict independence
is not reasonable (e.g. in ecological predation trials where the
prey are all in a single tank with the predator).  Technically you're
assuming both independence *and* homogeneity ... I would probably do
it this way, perhaps testing for overdispersion and/or adding
an individual-level random effect.
 
> I further gather that the less-well-known hypergeometric
> distribution is really most appropriate in cases where sampling is
> done without replacement, though I believe that neither lme4 nor
> MCMCglmm allows for this distribution, and again I have population
> rather than sample data.

  I think this would be a lot harder ... 
 
> I have a vague idea that another possibility would be to use the
> logit of each row (e.g., log(560/726) for the first row), and then
> simply model the logit with Normal errors. But what would I do then
> with rows that have proportions of 0 or 1? And, setting that issue
> aside, is the logit in principle an appropriate way to go?

  I don't think it's ridiculous, but yes, you have to deal with the
0/1 cases.  This fussing-with-zeros-and-ones stuff applies even if you
were to use a Beta regression (which you can do via glmmADMB, although
it'll probably be a lot slower than lme4), which would in some sense
be a more principled way to deal with proportion data.  I like the
idea of keeping the denominators in there and using a binomial model.
I don't think you're missing anything obvious, though.

 
> Can anybody suggest a way forward, and/or explain where my thinking
> above has gone wrong? For a final twist, the code below shows the
> binomial approach, where the old lme4 (now lme4.0) quickly returns
> seemingly sensible results, but the new lme4 (formerly lme4Eigen)
> returns an interesting error.

  For a quick answer, try optimizer="bobyqa", and/or varying tolPwrss
settings, but please be very cautious/compare your answers.  We're still
working on adjusting optimizer choice/settings to make GLMMs in the
new version fast, robust, and accurate ...

  Ben




More information about the R-sig-mixed-models mailing list