[R-sig-ME] modelling proportions, with aggregated data, and the new/old lme4

Sun Mar 18 14:31:04 CET 2012

Dear list,

I have a question for which a more theoretical or more technical answer may be helpful--I'm not sure. Any assistance would be appreciated.

I am trying to model Census data, where individual responses to a question with two answer options have been aggregated up to the community level, and communities are nested in subdistricts. So the outcome of interest is a proportion (randing from 0 to 1), and I know the absolute number of "successes" and "failures". For each community, the data are available in a slightly disaggregated form for different categories of people (I'll use the separate numbers of successes and failures for women and men as an example). Thus the data look like:

head(dat)
  successes failures id sex subdist
1       560      726  1   F       4
2       844      510  1   M       4
3       340      438  2   F       4
4       616      273  2   M       4
5         7        0  3   F       4
6         3        1  3   M       4

In community #1, which is in subdistrict #4, there are 560 women who are "successes" and 726 who are "failures" on this social indicator, etc. (The data are available via a link below.)

How should I model these data? I was originally thinking to use a binomial distribution, and a call defining the outcome as "cbind(successes, failures)", with each observation/row nested in (a) "id" and (b) "subdist".

(The idea of nesting the separate Census categories like this, and using a multilevel approach, comes from: http://www.hsph.harvard.edu/faculty/sv-subramanian/files/epa2001_33_3_399_417.pdf. There was a recent discussion of the use of "cbind" like this, but it seems workable in this case as a way to reduce tends of millions of observations to a few tens of thousands, with no loss of information: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2012q1/007552.html.)

However, I gather that the binomial distribution is only appropriate where one is sampling a number of Bernoulli trials (i.e., people in this case) with replacement, making them independent. As I understand it, even if the sampling is not actually done with replacement, if N >> n (say at least 10x larger) then the binomial approach can be justified, as the trials are independent enough to be treated as such. However, in my case, I have no sample at all--I have data on the population--so the binomial approach appears to be a non-starter. Is that right?

I further gather that the less-well-known hypergeometric distribution is really most appropriate in cases where sampling is done without replacement, though I believe that neither lme4 nor MCMCglmm allows for this distribution, and again I have population rather than sample data.

I have a vague idea that another possibility would be to use the logit of each row (e.g., log(560/726) for the first row), and then simply model the logit with Normal errors. But what would I do then with rows that have proportions of 0 or 1? And, setting that issue aside, is the logit in principle an appropriate way to go?

Can anybody suggest a way forward, and/or explain where my thinking above has gone wrong? For a final twist, the code below shows the binomial approach, where the old lme4 (now lme4.0) quickly returns seemingly sensible results, but the new lme4 (formerly lme4Eigen) returns an interesting error.

Many thanks,
Malcolm

load(url("http://dl.dropbox.com/u/46385700/dat.RData"))
closeAllConnections()

M1 <- glm(cbind(successes, failures) ~ sex, dat, family=binomial)
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(coef(M1))))

library(lme4.0) # lme4.0_0.9999-1

system.time(M2 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial))
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M2))))

system.time(M3 <- lmer(cbind(successes, failures) ~ sex + (1 | id) + (1 | subdist), dat, family=binomial))
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M3))))

detach(package:lme4.0)
library(lme4) # lme4_0.99990234375-0

# however, with the new lme4 (former lme4Eigen) I get an error:

system.time(M4 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial))
Error in fn(nM$xeval()) : 
  step factor reduced below 0.001 without reducing pwrss
Timing stopped at: 16.159 0.639 16.787