[R-sig-ME] Fwd: lme4, lme4a, and overdispersed distributions (again)

Tue Jun 29 14:40:15 CEST 2010

Thanks John,
You answered my question before it got past my lips.

Following on this, if I run two non-nested models with overdispersion accouted for using a random observation effect, what would be the proper way to rank them? Would you just use AICc in the standard way? Or if you were to use QAICc, what do you use as the scale parameter? I'm not exactly sure how to interpret the result.

Cheers,
Jeff

--- John Maindonald wrote:
Verson lme4_0.999375-34   that David is using seems not to be on CRAN yet.
But, as Steven indicated, one can get it from R-forge using the command.

install.packages("lme4", repos="http://R-Forge.R-project.org")

This is available for MacOS X as well as (I presume) for Windows.  The immediately
previous (-33) release fails compilation for MacOS X.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 29/06/2010, at 7:04 AM, David Atkins wrote:

> 
> Jeffrey Evans wrote:
>> Hi Dave, I see that Ben Bolker et al's wiki page also references it.
>> I'm not able to get glmer to do this. I do something similar with my dataset
>> (dat$over = 1:nrow(dat)) and I get an error, not a warning.
>>> Xfit$over = 1:nrow(Xfit)
>>> m800 = glmer(cbind(gmdat$SdlFinal, gmdat$SdlMax-gmdat$SdlFinal)
>> ~soilpc3+(1|gmdat$ID)+(1|over),data=Xfit, family="binomial")
>> Error in function (fr, FL, glmFit, start, nAGQ, verbose)  :   Number of levels of a grouping factor for the random effects
>> must be less than the number of observations
>> What version are you using? I'm running version 33.
> 
> Sorry, I should have included sessionInfo() in my earlier email:
> 
> Looks like I'm running v 34 with Matrix v 41 (on R 2.11.1)
> 
> [And I did just re-confirm that it works with a different dataset...]
> 
> cheers, dave
> 
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
> 
> other attached packages:
> [1] lme4_0.999375-34   Matrix_0.999375-41 lattice_0.19-7
> [4] car_2.0-0          survival_2.35-9    leaps_2.9
> [7] nnet_7.3-1         MASS_7.3-6
> 
> loaded via a namespace (and not attached):
> [1] grid_2.11.1    MCMCglmm_2.05  memisc_0.95-30 nlme_3.1-96
> [5] stats4_2.11.1
> 
>> Cheers,
>> Jeff
>> John et al.--
>> Actually, looks like the current version of glmer *does* allow observation level random-effects, though it throws you a little warning (which seems entirely appropriate). (Thank you Doug!)
>> Using the data that we were recently discussing (and attached to one of my previous posts "Data Redux"):
>>> drink.df$over <- 1:nrow(drink.df)
>>> drk.glmer <- glmer(drinks ~ weekday*gender + (1 | id) + (1 | over),
>> + 					data = drink.df, family = poisson,
>> + 					verbose = TRUE)
>> Number of levels of a grouping factor for the random effects
>> is *equal* to n, the number of observations
>>  0:     96443.694:  1.63299 0.215642 -0.912787 -0.0135537 0.0244628 -0.0137854 0.504067  1.13637  1.06483 0.418800 0.366592 0.288067 0.354827 0.443268 0.392234 0.287877
>> [snip]
>> 76:     73629.541:  4.35359 0.187209 -5.44079 -0.101647 -0.0249078 -0.0721097 0.513271  1.65032  1.51045 0.132999 0.435110 0.356431 0.382994 0.600458  1.26700 0.797529
>>> summary(drk.glmer)
>> Generalized linear mixed model fit by the Laplace approximation
>> Formula: drinks ~ weekday * gender + (1 | id) + (1 | over)
>>   Data: drink.df
>>   AIC   BIC logLik deviance
>> 73662 73805 -36815    73630
>> Random effects:
>> Groups Name        Variance  Std.Dev.
>> over   (Intercept) 18.953746 4.35359
>> id     (Intercept)  0.035047 0.18721
>> Number of obs: 56199, groups: over, 56199; id, 980
>> Fixed effects:
>>                         Estimate Std. Error z value Pr(>|z|)
>> (Intercept)              -5.44079    0.15315  -35.53  < 2e-16 ***
>> weekdayMonday            -0.10165    0.22164   -0.46  0.64651
>> weekdayTuesday           -0.02491    0.21837   -0.11  0.90919
>> weekdayWednesday         -0.07211    0.22089   -0.33  0.74408
>> weekdayThursday           0.51327    0.19953    2.57  0.01010 *
>> weekdayFriday             1.65032    0.17918    9.21  < 2e-16 ***
>> weekdaySaturday           1.51045    0.18023    8.38  < 2e-16 ***
>> genderM                   0.13300    0.22493    0.59  0.55432
>> weekdayMonday:genderM     0.43511    0.31295    1.39  0.16442
>> weekdayTuesday:genderM    0.35643    0.31078    1.15  0.25142
>> weekdayWednesday:genderM  0.38299    0.31327    1.22  0.22150
>> weekdayThursday:genderM   0.60046    0.28439    2.11  0.03474 *
>> weekdayFriday:genderM     1.26700    0.25845    4.90 9.48e-07 ***
>> weekdaySaturday:genderM   0.79712    0.26107    3.05  0.00226 **
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>> As an aside, it is interesting to see for this particular data how the variance swings pretty wildly between the model without over-dispersion to the current one:
>>> summary(drk.glmer)
>> Generalized linear mixed model fit by the Laplace approximation
>> Formula: drinks ~ weekday * gender + (1 | id)
>>   Data: drink.df
>>    AIC    BIC logLik deviance
>> 146572 146706 -73271   146542
>> Random effects:
>> Groups Name        Variance Std.Dev.
>> id     (Intercept) 0.92314  0.9608
>> Number of obs: 56199, groups: id, 980
>> Fixed effects:
>>                         Estimate Std. Error z value Pr(>|z|)
>> (Intercept)              -1.23639    0.04863  -25.42  < 2e-16 ***
>> weekdayMonday            -0.01035    0.03331   -0.31    0.756
>> weekdayTuesday            0.02583    0.03304    0.78    0.434
>> weekdayWednesday         -0.01161    0.03341   -0.35    0.728
>> weekdayThursday           0.50049    0.02976   16.82  < 2e-16 ***
>> weekdayFriday             1.12677    0.02691   41.87  < 2e-16 ***
>> weekdaySaturday           1.05954    0.02709   39.11  < 2e-16 ***
>> genderM                   0.30064    0.07099    4.24 2.28e-05 ***
>> weekdayMonday:genderM     0.36970    0.04359    8.48  < 2e-16 ***
>> weekdayTuesday:genderM    0.29552    0.04358    6.78 1.19e-11 ***
>> weekdayWednesday:genderM  0.36295    0.04378    8.29  < 2e-16 ***
>> weekdayThursday:genderM   0.45315    0.03913   11.58  < 2e-16 ***
>> weekdayFriday:genderM     0.40290    0.03587   11.23  < 2e-16 ***
>> weekdaySaturday:genderM   0.29764    0.03624    8.21  < 2e-16 ***
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>> These data in all likelihood are probably best modeled by a two-part model, with zero vs. non-zero and count model for non-zeroes.  The current version of MCMCglmm allows for zero-inflated models (where zeroes are a mixture of a point mass and count distribution), and the development version has hurdle formulations (with zero vs. non-zero, and then truncated count distribution for non-zeroes).
>> cheers, Dave
>> John Maindonald wrote:
>> I think it more accurate to say that, in general, there may be
>> a class of distributions, and therefore a possible multiplicity
>> of likelihoods, not necessarily for distributions of exponential
>> form.  This is a PhD thesis asking to be done, or maybe
>> someone has already done it.
>> Over-dispersed distributions, where it is entirely clear what the
>> distribution is, can be generated as GLM model +  one random
>> effect per observation.  We have discussed this before.  This
>> seems to me the preferred way to go, if such a model seems to
>> fit the data.  I've not checked the current state of play re fitting
>> such models in lme4 of lme4a; in the past some versions have
>> allowed such a model.
>> I like the simplicity of the one random effect per observation
>> approach, as against what can seem the convoluted theoretical
>> framework in which beta binomials live.
>> John Maindonald             email: john.maindonald at anu.edu.au
>> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>> Centre for Mathematics & Its Applications, Room 1194,
>> John Dedman Mathematical Sciences Building (Building 27)
>> Australian National University, Canberra ACT 0200.
>> http://www.maths.anu.edu.au/~johnm
>>> 
>>> On 25/06/2010, at 3:59 AM, Jeffrey Evans wrote:
>>> 
>>>> Since I am definitely *not* a mathematician, I am straying in over my head
>>>> here.
>>>> 
>>>> I understand what you are saying - that there isn't a likelihood function
>>>> for the quasi-binomial "distribution". And therefore, there is no-such
>>>> distribution.
>>>> 
>>>> What do you think of the suggestion that a beta-binomial mixture
>>>> distribution could be used to model overdispersed binomial data?
>>>> 
>>>> Would this be a techinically correct and logistically feasibile solution?
>>>> 
>>>> -jeff
>>>> 
>>>> -----Original Message-----
>>>> From: dmbates at gmail.com [mailto:dmbates at gmail.com] On Behalf Of Douglas
>>>> Bates
>>>> Sent: Thursday, June 24, 2010 1:25 PM
>>>> To: Jeffrey Evans
>>>> Cc: r-sig-mixed-models at r-project.org
>>>> Subject: Re: [R-sig-ME] lme4, lme4a, and overdispersed distributions (again)
>>>> 
>>>> On Thu, Jun 24, 2010 at 11:54 AM, Jeffrey Evans
>>>> <Jeffrey.Evans at dartmouth.edu> wrote:
>>>>> Like others, I have experienced trouble with estimation of the scale
>>>>> parameter using the quasi-distributions in lme4, which is necessary to
>>>>> calculate QAICc and rank overdispersed generalized linear mixed models.
>>>> 
>>>>> I had several exchanges with Ben Bolker about this early last year
>>>>> after his TREE paper came out
>>>>> (http://www.cell.com/trends/ecology-evolution/abstract/S0169-5347%2809
>>>>> %29000 19-6), and I know it's been discussed on on this list. Has
>>>>> there been or is there any potential resolution to this forthcoming in
>>>>> future releases of
>>>>> lme4 or lme4a? I run into overdispersed binomial distributions
>>>>> frequently and have had to use SAS to deal with them. SAS appears to
>>>>> work, but it won't estimate the overdispersion parameter using laplace
>>>>> estimation (only PQL), As I understand it, these pseudo-Iikelihoods
>>>>> can't be used for model ranking. I don't know why SAS can't/won't, but
>>>>> lme4 will run these quasi-binomial and quasi-poisson distributions with
>>>> Laplace estimation.
>>>> 
>>>>> Is there a workable way to use lme4 for modeling overdispersed
>>>>> binomial data?
>>>> 
>>>> I have trouble discussing this because I come from a background as a
>>>> mathematician and am used to tracing derivations back to the original
>>>> definitions.  So when I think of a likelihood (or, equivalently, a
>>>> deviance) to be optimized it only makes sense to me if there is a
>>>> probability distribution associated with the model.  And for the
>>>> quasi-binomial and quasi-Poisson families, there isn't a probability
>>>> distribution.  To me that means that discussing maximum likelihood
>>>> estimators for such models is nonsense.  The models simply do not exist.
>>>> One can play tricks in the case of a generalized linear model to estimate a
>>>> "quasi-parameter" that isn't part of the probability distribution but it is
>>>> foolhardy to expect that the tricks will automatically carry over to a
>>>> generalized linear mixed model.
>>>> 
>>>> I am not denying that data that are over-dispersed with respect to the
>>>> binomial or Poisson distributions can and do occur.  But having data like
>>>> this and a desire to model it doesn't make the quasi families real.   In his
>>>> signature Thierry Onkelinx quotes
>>>> 
>>>> The combination of some data and an aching desire for an answer does not
>>>> ensure that a reasonable answer can be extracted from a given body of data.
>>>> ~ John Tukey
>>>> 
>>>> I could and do plan to incorporate the negative binomial family but, without
>>>> a definition that I can understand of a quasi-binomial or quasi-Poisson
>>>> distribution and its associated probability function, I'm stuck. To me it's
>>>> a "build bricks without straw" situation - you can't find maximum likelihood
>>>> estimates for parameters that aren't part of the likelihood.
>>>> 
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>> 
>> John Maindonald             email: john.maindonald at anu.edu.au
>> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>> Centre for Mathematics & Its Applications, Room 1194,
>> John Dedman Mathematical Sciences Building (Building 27)
>> Australian National University, Canberra ACT 0200.
>> http://www.maths.anu.edu.au/~johnm
>> 
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--- end of quote ---