[R-sig-ME] Multilevel logistic regression

Sun Mar 22 18:41:27 CET 2009

On Sun, Mar 22, 2009 at 9:22 AM, Brant Inman <brant.inman at me.com> wrote:
> Ben,

> Thanks for confirming my findings. Two questions for you:

> 1) Is it normal that lmer does NOT find the following 3 formulae to be equivalent?

> f1 <- lmer( cbind(success, fail) ~ covariates + (1 | group), data=wide ...
> f2 <- lmer( (success/(success+fail)) ~ covariates + (1 | group), weights=(success+fail), data=wide ...
> f3 <- lmer( success ~ covariates + (1 | group), data=long ...

The code in lmer to count the "number of observations" is too
simplistic.  It uses the length of the vector of responses which is
appropriate except for the case of models like f1.  What should be
done in that case is the evaluate the sum of the number of cases
represented by each row of the data.

Tracking such things down is not terribly easy.  The "initialize"
expression from a glm family is "not unlike" a gross hack. It is an
expression (not a function) that is evaluated for the sole purpose of
scattering a bunch of objects around the function evaluation
environment, which often ends up looking like a teenager's bedroom
afterwards.

Yes, special purpose code could be written to detect a situation like
model f1 and account for it appropriately.  Doing so will mean keeping
track of two, possibly different, definitions of the "number of
observations".  If someone is willing to produce a patch for the code
I will consider it.

I don't think it will be as easy to detect that model f2 represents
more observations than are recorded in the data.  All you know is that
you are given observation weights, which could be the result of
effects other than multiple "observations" per observation.

> 2) How did you hack lmer to avoid this problem and did it compute the result faster with the wide dataset than the long one?

Well, you start by grepping the source code to determine where the
error message is produced then you disable that part of the code.

The error message is there for a reason - primarily related to linear
mixed models.  Sometimes people get carried away and define random
effects with respect to a factor that has a distinct level for each
observation.  You can estimate the parameters in such a model but not
uniquely.  The variance due to such a "one level per observation"
factor is completely confounded with the variance of the
"per-observation" noise term.  People have reported results from such
models fit by lme so I wanted to protect against that by flagging such
models in lme4.  One could adopt a "caveat emptor" approach and say
that the user is responsible for ensuring that they model that they
fit is mathematically reasonable so it they get nonsense results for a
nonsense model it's their fault.  Unfortunately, that ends up
reflecting badly on lme4 or the R project.  People fail to distinguish
between a package in R and R in general so this type of problem is
reported as a flaw in R, often by people who want to convince
companies to pony up large amounts of money for commercial software
licenses.

Hence, I have opted for the more conservative approach of throwing an
error in such cases.   As Ben has pointed out, lme4 is open source so
if you want to modify it to fit a particular type of model that
currently produces an error, you have the opportunity of doing so.

> ----------------------------------------------------------------------------------------------------------------------------
>
> On Saturday, March 21, 2009, at 11:52PM, "Ben Bolker" <bolker at ufl.edu> wrote:
>>   I tried this out with a hacked version of lme4 that removes
>>the test on the number of observations, and get reasonable answers
>>for f3, to wit:
>>
>>Random effects:
>> Groups Name        Variance Std.Dev. Corr
>> SOURCE (Intercept) 0.78845  0.88795
>>        TELDUM      0.24216  0.49210  -0.281
>>        MAILDUM     0.52215  0.72260  -0.365  0.317
>>Number of obs: 105, groups: SOURCE, 48
>>
>>Fixed effects:
>>            Estimate Std. Error z value Pr(>|z|)
>>(Intercept)  1.12762    0.19519   5.777  7.6e-09 ***
>>RESPISRR     0.20714    0.21497   0.964 0.335258
>>TELDUM      -0.20251    0.09141  -2.215 0.026737 *
>>MAILDUM     -0.56094    0.14708  -3.814 0.000137 ***
>>---
>>
>>  The random effects (esp for telephone/mail) are a little different.
>>It doesn't look like the software used in the chapter estimates
>>correlations among random effects?
>>
>>## Table 6.4 Models for response rates in different conditions
>>## Fixed part           conditions fixed       conditions random
>>##                      coeff. (s.e.)          coeff. (s.e.)
>>## Predictor
>>## intercept            0.90 (.14)             1.17 (.21)
>>## resptype             0.53 (.06)             0.20 (.23)
>>## telephone            -0.16 (.02)            -0.20 (.10)
>>## mail                 -0.49 (.03)            -0.58 (.16)
>>## Random part
>>## intercept1           1.00                   1.00
>>## intercept2           0.86 (.18)             0.87 (.20)
>>## telephone                                   0.26 (.08)
>>## mail                                        0.59 (.20)
>>
>>Eliminating the correlation among random effects:
>>
>>f3B <- lmer(y  ~ RESPISRR + TELDUM + MAILDUM + (1|SOURCE) +
>>(0+TELDUM|SOURCE) + (0+MAILDUM|SOURCE),
>>                family=binomial, data=wide)
>>summary(f3B)
>>
>>Random effects:
>> Groups Name        Variance Std.Dev.
>> SOURCE (Intercept) 0.72203  0.84972
>> SOURCE TELDUM      0.23582  0.48562
>> SOURCE MAILDUM     0.43923  0.66274
>>Number of obs: 105, groups: SOURCE, 48
>>
>>
>>  telephone/mail random effects still estimated higher.
>>Tried nAGQ=6 (crank up Gauss-Hermite quadrature):
>>
>>Random effects:
>> Groups Name        Variance Std.Dev.
>> SOURCE (Intercept) 0.74275  0.86183
>> SOURCE TELDUM      0.24565  0.49563
>> SOURCE MAILDUM     0.28567  0.53448
>>Number of obs: 105, groups: SOURCE, 48
>>
>>Brings mail RE down (*below* PQL est.) but telephone RE is still high.
>>
>>
>>
>>
>>Brant Inman wrote:
>>> Thierry et al:
>>>
>>> I think I solved my problem and it provides some insight into the way
>>> lmer handles binomial data, so I will share the findings here. First,
>>> I have made two datasets available on the web, a long format and a
>>> wide format version of the "metaresp" data of Joop Hox..  They can be
>>> found here
>>>
>>> http://www.duke.edu/~bi6/
>>>
>>> Hox has a draft of the chapter of interest that discusses the
>>> metaresp dataset and the modeling process/problem that I am trying to
>>> solve.  Note that the results that I am trying to reproduce are in
>>> Tables 6.3 and 6.4.  The chapter can be found at:
>>>
>>> http://www.geocities.com/joophox/papers/chap6.pdf
>>>
>>> Now here are the models that I fit with lmer.  Assume that the wide
>>> version of the data is called "wide" and the long version "long".
>>> -----------------------------
>>>
>>> y <- cbind(wide$SUCCESS, wide$FAIL)
>>>
>>> f1 <- lmer(y ~ RESPISRR + (1 | SOURCE), family=binomial, data=wide)
>>> summary(f1)
>>>
>>> f2 <- lmer(y ~ RESPISRR + TELDUM + MAILDUM + (1 | SOURCE),
>>> family=binomial, data=wide) summary(f2)
>>>
>>> f3 <- lmer(y ~  ~ RESPISRR + TELDUM + MAILDUM + (1 + TELDUM + MAILDUM
>>> | SOURCE), family=binomial, data=wide) summary(f3)
>>>
>>> f4 <- lmer(SUCCESS ~ RESPISRR + TELDUM + MAILDUM + (1 + TELDUM +
>>> MAILDUM | SOURCE), family=binomial, data=long) summary(f4)
>>>
>>> -------------------------------
>>>
>>> Models f1, f2, and f4 work and reproduce the results of Hox.  Model
>>> f4 takes a hell of a long time to compute, but it seems to give the
>>> expected results.  Model f3, which I assumed (wrongly) would be the
>>> same as f4, does not work.  Instead, when it is run, you get the
>>> error message:
>>>
>>>> Error in mer_finalize(ans) : q = 240 > n = 105
>>>
>>> I guess the question that I have now is: did I do something wrong
>>> with model f3 or is lmer doing something unusual?  My assumption that
>>> models f3 and f4 were the same comes from MASS4 p190 where Ripley
>>> describes the glm function for logistic regression.
>>>
>>> I very much appreciate any insight.
>>>
>>> Brant
>>>
>>> #####################################################################################
>>>
>>>
>>> On Friday, March 20, 2009, at 04:32AM, "ONKELINX, Thierry"
>>> <Thierry.ONKELINX at inbo.be> wrote:
>>>> Dear Brant,
>>>>
>>>> The model is too complex. You have maximum three observations for
>>>> each level of the random effect. Allowing for a random intercept
>>>> and two random slopes does not make much sense then. Does it?
>>>>
>>>> HTH,
>>>>
>>>> Thierry
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>  ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek /
>>>> Research Institute for Nature and Forest Cel biometrie,
>>>> methodologie en kwaliteitszorg / Section biometrics, methodology
>>>> and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium
>>>> tel. + 32 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be
>>>>
>>>> To call in the statistician after the experiment is done may be no
>>>> more than asking him to perform a post-mortem examination: he may
>>>> be able to say what the experiment died of. ~ Sir Ronald Aylmer
>>>> Fisher
>>>>
>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>>
>>>> The combination of some data and an aching desire for an answer
>>>> does not ensure that a reasonable answer can be extracted from a
>>>> given body of data. ~ John Tukey
>>>>
>>>> -----Oorspronkelijk bericht----- Van:
>>>> r-sig-mixed-models-bounces at r-project.org
>>>> [mailto:r-sig-mixed-models-bounces at r-project.org] Namens Brant
>>>> Inman Verzonden: donderdag 19 maart 2009 3:11 Aan:
>>>> r-sig-mixed-models at r-project.org Onderwerp: [R-sig-ME] Multilevel
>>>> logistic regression
>>>>
>>>>
>>>> lmer Experts:
>>>>
>>>> I am trying to use lmer to duplicate the results found in Joop
>>>> Hox's book "Multilevel Analysis: technique and applications" 2002.
>>>> In chapter 6 of his book he shows an example of multilevel logistic
>>>>  regression for a meta-analysis of survey response rates.  The data
>>>> are available in the file "metaresp.xls" at his website:
>>>>
>>>> <http://www.geocities.com/joophox/mlbook/excelxls.zip>
>>>>
>>>> The dataset includes the following variables of interest:
>>>>
>>>> Individual level (Level 1) variables: TELDUM         = telephone
>>>> questioning MAILDUM  = mail questioning RESPONSE = the outcome of
>>>> interest, the study response rate DENOM    = the number of people
>>>> questioned
>>>>
>>>> Study/group level (Level 2) variables: SOURCE        = the study
>>>> identifier YEAR      = year of study SALIENCY = how salient the
>>>> questionnaire was (0 to 2) RESPISRR = the way the response rate was
>>>> calculated
>>>>
>>>>
>>>> The null model (Table 6.2) proposed by Joop is easy to fit:
>>>>
>>>> SUCCESS <- as.integer(RESPISRR*DENOM) y     <- cbind(SUCCESS,
>>>> DENOM-SUCCESS)
>>>>
>>>> f1 <- lmer(y ~ RESPISRR + (1 | SOURCE),
>>>> family=binomial(link=logit))
>>>>
>>>>
>>>> Joop then adds a couple Level 1 variables (Table 6.3):
>>>>
>>>> f2 <- lmer(y ~ RESPISRR + TELNUM + MAILDUM + (1 | SOURCE),
>>>> family=binomial(link=logit))
>>>>
>>>>
>>>> He then says that these two Level 1 variables should be allowed to
>>>>  vary across studies (varying slopes).  When I try to fit what I
>>>> believe to be the correct model, I get an error
>>>>
>>>>
>>>> f3 <- lmer(y ~ RESPISRR + TELNUM + MAILDUM + (TELNUM | SOURCE) +
>>>> (MAILDUM | SOURCE) + (1 | SOURCE), family=binomial(link=logit))
>>>>
>>>> Error in mer_finalize(ans) : q = 240 > n = 105
>>>>
>>>>
>>>> Can anyone tell me what I am doing wrong here?  Thanks so much in
>>>>  advance.
>>>>
>>>> Brant Inman Duke University Medical Center
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>> Dit bericht en eventuele bijlagen geven enkel de visie van de
>>>> schrijver weer en binden het INBO onder geen enkel beding, zolang
>>>> dit bericht niet bevestigd is door een geldig ondertekend document.
>>>> The views expressed in  this message and any annex are purely those
>>>> of the writer and may not be regarded as stating an official
>>>> position of INBO, as long as the message is not confirmed by a duly
>>>>  signed document.
>>>>
>>>>
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>>
>>--
>>Ben Bolker
>>Associate professor, Biology Dep't, Univ. of Florida
>>bolker at ufl.edu / www.zoology.ufl.edu/bolker
>>GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
>>
>>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>