[R-sig-ME] Multilevel logistic regression
Douglas Bates
bates at stat.wisc.edu
Mon Mar 23 17:45:38 CET 2009
On Mon, Mar 23, 2009 at 10:05 AM, Brant Inman <brant.inman at me.com> wrote:
> Thanks for everyone's explanations. I truly appreciate it. Since I have never really delved into S4 objects like lmer, I now have a reason now to crack open Chambers' green book and figure out how to track lmer down to its depths to modify the error message line.
Actually you don't need to learn about S4 classes and methods to be
able to bypass this error message. If you search in the source code
for a string that begins with "q = " you will find that the error
message is generated in the C function called update_u in the
lme4/src/lmer.c file.
> This thread is why I think R is so great: as long as your question is not too terribly dumb you get advice from the experts!
>
> Cheers,
>
> Brant
>
> -----------------------------------------
>
> On Sunday, March 22, 2009, at 12:41PM, "Douglas Bates" <bates at stat.wisc.edu> wrote:
>>On Sun, Mar 22, 2009 at 9:22 AM, Brant Inman <brant.inman at me.com> wrote:
>>> Ben,
>>
>>> Thanks for confirming my findings. Two questions for you:
>>
>>> 1) Is it normal that lmer does NOT find the following 3 formulae to be equivalent?
>>
>>> f1 <- lmer( cbind(success, fail) ~ covariates + (1 | group), data=wide ...
>>> f2 <- lmer( (success/(success+fail)) ~ covariates + (1 | group), weights=(success+fail), data=wide ...
>>> f3 <- lmer( success ~ covariates + (1 | group), data=long ...
>>
>>The code in lmer to count the "number of observations" is too
>>simplistic. It uses the length of the vector of responses which is
>>appropriate except for the case of models like f1. What should be
>>done in that case is the evaluate the sum of the number of cases
>>represented by each row of the data.
>>
>>Tracking such things down is not terribly easy. The "initialize"
>>expression from a glm family is "not unlike" a gross hack. It is an
>>expression (not a function) that is evaluated for the sole purpose of
>>scattering a bunch of objects around the function evaluation
>>environment, which often ends up looking like a teenager's bedroom
>>afterwards.
>>
>>Yes, special purpose code could be written to detect a situation like
>>model f1 and account for it appropriately. Doing so will mean keeping
>>track of two, possibly different, definitions of the "number of
>>observations". If someone is willing to produce a patch for the code
>>I will consider it.
>>
>>I don't think it will be as easy to detect that model f2 represents
>>more observations than are recorded in the data. All you know is that
>>you are given observation weights, which could be the result of
>>effects other than multiple "observations" per observation.
>>
>>> 2) How did you hack lmer to avoid this problem and did it compute the result faster with the wide dataset than the long one?
>>
>>Well, you start by grepping the source code to determine where the
>>error message is produced then you disable that part of the code.
>>
>>The error message is there for a reason - primarily related to linear
>>mixed models. Sometimes people get carried away and define random
>>effects with respect to a factor that has a distinct level for each
>>observation. You can estimate the parameters in such a model but not
>>uniquely. The variance due to such a "one level per observation"
>>factor is completely confounded with the variance of the
>>"per-observation" noise term. People have reported results from such
>>models fit by lme so I wanted to protect against that by flagging such
>>models in lme4. One could adopt a "caveat emptor" approach and say
>>that the user is responsible for ensuring that they model that they
>>fit is mathematically reasonable so it they get nonsense results for a
>>nonsense model it's their fault. Unfortunately, that ends up
>>reflecting badly on lme4 or the R project. People fail to distinguish
>>between a package in R and R in general so this type of problem is
>>reported as a flaw in R, often by people who want to convince
>>companies to pony up large amounts of money for commercial software
>>licenses.
>>
>>Hence, I have opted for the more conservative approach of throwing an
>>error in such cases. As Ben has pointed out, lme4 is open source so
>>if you want to modify it to fit a particular type of model that
>>currently produces an error, you have the opportunity of doing so.
>>
>>
>>> ----------------------------------------------------------------------------------------------------------------------------
>>>
>>> On Saturday, March 21, 2009, at 11:52PM, "Ben Bolker" <bolker at ufl.edu> wrote:
>>>> I tried this out with a hacked version of lme4 that removes
>>>>the test on the number of observations, and get reasonable answers
>>>>for f3, to wit:
>>>>
>>>>Random effects:
>>>> Groups Name Variance Std.Dev. Corr
>>>> SOURCE (Intercept) 0.78845 0.88795
>>>> TELDUM 0.24216 0.49210 -0.281
>>>> MAILDUM 0.52215 0.72260 -0.365 0.317
>>>>Number of obs: 105, groups: SOURCE, 48
>>>>
>>>>Fixed effects:
>>>> Estimate Std. Error z value Pr(>|z|)
>>>>(Intercept) 1.12762 0.19519 5.777 7.6e-09 ***
>>>>RESPISRR 0.20714 0.21497 0.964 0.335258
>>>>TELDUM -0.20251 0.09141 -2.215 0.026737 *
>>>>MAILDUM -0.56094 0.14708 -3.814 0.000137 ***
>>>>---
>>>>
>>>> The random effects (esp for telephone/mail) are a little different.
>>>>It doesn't look like the software used in the chapter estimates
>>>>correlations among random effects?
>>>>
>>>>## Table 6.4 Models for response rates in different conditions
>>>>## Fixed part conditions fixed conditions random
>>>>## coeff. (s.e.) coeff. (s.e.)
>>>>## Predictor
>>>>## intercept 0.90 (.14) 1.17 (.21)
>>>>## resptype 0.53 (.06) 0.20 (.23)
>>>>## telephone -0.16 (.02) -0.20 (.10)
>>>>## mail -0.49 (.03) -0.58 (.16)
>>>>## Random part
>>>>## intercept1 1.00 1.00
>>>>## intercept2 0.86 (.18) 0.87 (.20)
>>>>## telephone 0.26 (.08)
>>>>## mail 0.59 (.20)
>>>>
>>>>Eliminating the correlation among random effects:
>>>>
>>>>f3B <- lmer(y ~ RESPISRR + TELDUM + MAILDUM + (1|SOURCE) +
>>>>(0+TELDUM|SOURCE) + (0+MAILDUM|SOURCE),
>>>> family=binomial, data=wide)
>>>>summary(f3B)
>>>>
>>>>Random effects:
>>>> Groups Name Variance Std.Dev.
>>>> SOURCE (Intercept) 0.72203 0.84972
>>>> SOURCE TELDUM 0.23582 0.48562
>>>> SOURCE MAILDUM 0.43923 0.66274
>>>>Number of obs: 105, groups: SOURCE, 48
>>>>
>>>>
>>>> telephone/mail random effects still estimated higher.
>>>>Tried nAGQ=6 (crank up Gauss-Hermite quadrature):
>>>>
>>>>Random effects:
>>>> Groups Name Variance Std.Dev.
>>>> SOURCE (Intercept) 0.74275 0.86183
>>>> SOURCE TELDUM 0.24565 0.49563
>>>> SOURCE MAILDUM 0.28567 0.53448
>>>>Number of obs: 105, groups: SOURCE, 48
>>>>
>>>>Brings mail RE down (*below* PQL est.) but telephone RE is still high.
>>>>
>>>>
>>>>
>>>>
>>>>Brant Inman wrote:
>>>>> Thierry et al:
>>>>>
>>>>> I think I solved my problem and it provides some insight into the way
>>>>> lmer handles binomial data, so I will share the findings here. First,
>>>>> I have made two datasets available on the web, a long format and a
>>>>> wide format version of the "metaresp" data of Joop Hox.. They can be
>>>>> found here
>>>>>
>>>>> http://www.duke.edu/~bi6/
>>>>>
>>>>> Hox has a draft of the chapter of interest that discusses the
>>>>> metaresp dataset and the modeling process/problem that I am trying to
>>>>> solve. Note that the results that I am trying to reproduce are in
>>>>> Tables 6.3 and 6.4. The chapter can be found at:
>>>>>
>>>>> http://www.geocities.com/joophox/papers/chap6.pdf
>>>>>
>>>>> Now here are the models that I fit with lmer. Assume that the wide
>>>>> version of the data is called "wide" and the long version "long".
>>>>> -----------------------------
>>>>>
>>>>> y <- cbind(wide$SUCCESS, wide$FAIL)
>>>>>
>>>>> f1 <- lmer(y ~ RESPISRR + (1 | SOURCE), family=binomial, data=wide)
>>>>> summary(f1)
>>>>>
>>>>> f2 <- lmer(y ~ RESPISRR + TELDUM + MAILDUM + (1 | SOURCE),
>>>>> family=binomial, data=wide) summary(f2)
>>>>>
>>>>> f3 <- lmer(y ~ ~ RESPISRR + TELDUM + MAILDUM + (1 + TELDUM + MAILDUM
>>>>> | SOURCE), family=binomial, data=wide) summary(f3)
>>>>>
>>>>> f4 <- lmer(SUCCESS ~ RESPISRR + TELDUM + MAILDUM + (1 + TELDUM +
>>>>> MAILDUM | SOURCE), family=binomial, data=long) summary(f4)
>>>>>
>>>>> -------------------------------
>>>>>
>>>>> Models f1, f2, and f4 work and reproduce the results of Hox. Model
>>>>> f4 takes a hell of a long time to compute, but it seems to give the
>>>>> expected results. Model f3, which I assumed (wrongly) would be the
>>>>> same as f4, does not work. Instead, when it is run, you get the
>>>>> error message:
>>>>>
>>>>>> Error in mer_finalize(ans) : q = 240 > n = 105
>>>>>
>>>>> I guess the question that I have now is: did I do something wrong
>>>>> with model f3 or is lmer doing something unusual? My assumption that
>>>>> models f3 and f4 were the same comes from MASS4 p190 where Ripley
>>>>> describes the glm function for logistic regression.
>>>>>
>>>>> I very much appreciate any insight.
>>>>>
>>>>> Brant
>>>>>
>>>>> #####################################################################################
>>>>>
>>>>>
>>>>> On Friday, March 20, 2009, at 04:32AM, "ONKELINX, Thierry"
>>>>> <Thierry.ONKELINX at inbo.be> wrote:
>>>>>> Dear Brant,
>>>>>>
>>>>>> The model is too complex. You have maximum three observations for
>>>>>> each level of the random effect. Allowing for a random intercept
>>>>>> and two random slopes does not make much sense then. Does it?
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> Thierry
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek /
>>>>>> Research Institute for Nature and Forest Cel biometrie,
>>>>>> methodologie en kwaliteitszorg / Section biometrics, methodology
>>>>>> and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium
>>>>>> tel. + 32 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be
>>>>>>
>>>>>> To call in the statistician after the experiment is done may be no
>>>>>> more than asking him to perform a post-mortem examination: he may
>>>>>> be able to say what the experiment died of. ~ Sir Ronald Aylmer
>>>>>> Fisher
>>>>>>
>>>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>>>>
>>>>>> The combination of some data and an aching desire for an answer
>>>>>> does not ensure that a reasonable answer can be extracted from a
>>>>>> given body of data. ~ John Tukey
>>>>>>
>>>>>> -----Oorspronkelijk bericht----- Van:
>>>>>> r-sig-mixed-models-bounces at r-project.org
>>>>>> [mailto:r-sig-mixed-models-bounces at r-project.org] Namens Brant
>>>>>> Inman Verzonden: donderdag 19 maart 2009 3:11 Aan:
>>>>>> r-sig-mixed-models at r-project.org Onderwerp: [R-sig-ME] Multilevel
>>>>>> logistic regression
>>>>>>
>>>>>>
>>>>>> lmer Experts:
>>>>>>
>>>>>> I am trying to use lmer to duplicate the results found in Joop
>>>>>> Hox's book "Multilevel Analysis: technique and applications" 2002.
>>>>>> In chapter 6 of his book he shows an example of multilevel logistic
>>>>>> regression for a meta-analysis of survey response rates. The data
>>>>>> are available in the file "metaresp.xls" at his website:
>>>>>>
>>>>>> <http://www.geocities.com/joophox/mlbook/excelxls.zip>
>>>>>>
>>>>>> The dataset includes the following variables of interest:
>>>>>>
>>>>>> Individual level (Level 1) variables: TELDUM = telephone
>>>>>> questioning MAILDUM = mail questioning RESPONSE = the outcome of
>>>>>> interest, the study response rate DENOM = the number of people
>>>>>> questioned
>>>>>>
>>>>>> Study/group level (Level 2) variables: SOURCE = the study
>>>>>> identifier YEAR = year of study SALIENCY = how salient the
>>>>>> questionnaire was (0 to 2) RESPISRR = the way the response rate was
>>>>>> calculated
>>>>>>
>>>>>>
>>>>>> The null model (Table 6.2) proposed by Joop is easy to fit:
>>>>>>
>>>>>> SUCCESS <- as.integer(RESPISRR*DENOM) y <- cbind(SUCCESS,
>>>>>> DENOM-SUCCESS)
>>>>>>
>>>>>> f1 <- lmer(y ~ RESPISRR + (1 | SOURCE),
>>>>>> family=binomial(link=logit))
>>>>>>
>>>>>>
>>>>>> Joop then adds a couple Level 1 variables (Table 6.3):
>>>>>>
>>>>>> f2 <- lmer(y ~ RESPISRR + TELNUM + MAILDUM + (1 | SOURCE),
>>>>>> family=binomial(link=logit))
>>>>>>
>>>>>>
>>>>>> He then says that these two Level 1 variables should be allowed to
>>>>>> vary across studies (varying slopes). When I try to fit what I
>>>>>> believe to be the correct model, I get an error
>>>>>>
>>>>>>
>>>>>> f3 <- lmer(y ~ RESPISRR + TELNUM + MAILDUM + (TELNUM | SOURCE) +
>>>>>> (MAILDUM | SOURCE) + (1 | SOURCE), family=binomial(link=logit))
>>>>>>
>>>>>> Error in mer_finalize(ans) : q = 240 > n = 105
>>>>>>
>>>>>>
>>>>>> Can anyone tell me what I am doing wrong here? Thanks so much in
>>>>>> advance.
>>>>>>
>>>>>> Brant Inman Duke University Medical Center
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-mixed-models at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>>>
>>>>>> Dit bericht en eventuele bijlagen geven enkel de visie van de
>>>>>> schrijver weer en binden het INBO onder geen enkel beding, zolang
>>>>>> dit bericht niet bevestigd is door een geldig ondertekend document.
>>>>>> The views expressed in this message and any annex are purely those
>>>>>> of the writer and may not be regarded as stating an official
>>>>>> position of INBO, as long as the message is not confirmed by a duly
>>>>>> signed document.
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-mixed-models at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>>
>>>>--
>>>>Ben Bolker
>>>>Associate professor, Biology Dep't, Univ. of Florida
>>>>bolker at ufl.edu / www.zoology.ufl.edu/bolker
>>>>GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
>>>>
>>>>
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>>
>
More information about the R-sig-mixed-models
mailing list