[R-sig-ME] prediction from glmer(..., family = binomial(link = "probit"))

Fri Jan 9 15:58:54 CET 2015

I’m using glmer to fit a binomial GLMM with a probit link to some simulated data, and I’m looking for help in interpreting (back-transforming) the intercept estimate.

The response is logical (is a subject resistant to an infectious disease or not). Subjects are clustered in 700 families, with 3 subjects in each family:

        fam resistant
  1       1     FALSE
  2       1     FALSE
  3       1     FALSE
  4       2     FALSE
  5       2     FALSE
  6       2     FALSE
  7       3     FALSE
  8       3     FALSE
  9       3     FALSE
  …
  2100   700    FALSE

The prevalence of resistance is low, at 5%, and the degree of familial clustering is moderate, with ICC = 0.25 (this is the ICC of the latent trait liability). The prevalence in the simulated data is 5.3%, close to the parameter value. 

I fitted the following model: 

  fit.glmer <- glmer(resistant ~ 1 + (1 | fam), family = binomial(link = "probit"), data = simdat)

The intercept estimate is -3.79. Back-transforming this gives a very low prevalence estimate of about 0.01% when I marginalise over the random effect:
> pnorm(fixef(fit.glmer)) 
 (Intercept) 
7.550163e-05 
or equivalently:
>   predict(fit.glmer, type = "response", re.form = NA)[1]
           1 
7.550163e-05 

Averaging over the family-specific predicted prevalences gets closer, but still underestimates the true prevalence (this bias is repeatable, so not just dues to sampling families):
> mean(predict(fit.glmer, type = "response", re.form = NULL))
[1] 0.0435032
This does better because back-transforming before taking the mean draws in the predicted prevalence of the families with no resistant subjects, which is very low on the probit scale but bounded by zero on the proportion scale. However I’m pretty sure this is the wrong way to do it.

Perhaps a more likely explanation is that I’m wrong to assume variance = 1 in the inverse probit function? Should I be adding the inter-family variance, additive dispersion (is this fixed at 1?) and the distribution specific variance, as predict.MCMCglmm does (I’ve fitted the same model in MCMCglmm and it recovers the prevalence and ICC accurately)? However the family variance estimate is so large that including it results in large overestimates, and the fact that predict.merMod doesn’t do this suggests it isn’t necessary.

Alternatively, could glmer be overestimating the inter-family variance, leading to an underestimate of the prevalence, due to a combination of low prevalence and and small families (n=3; most families have 0/3 resistant, and the maximum is 2/3). 

I’m exploring these possibilities but it would be great to have some expert advice.

Thanks in advance,
Paul

PS I could post the simulation code but it’s rather long so I’ll wait and see if there’s a simple solution.