[R-sig-ME] glmer.nb warning and prediction problem

Sat Mar 5 21:13:16 CET 2022

On 3/5/22 6:29 AM, Adriaan de Jong wrote:
> Hi Ben,
> Thank you very much for your comments and suggestions.
> 
> As for point 2.
> Adding quotation marks didn't solve the problem. But from the lapply(model.frame(mod5), unique) printout (listed below) I understand that the Dec_Hour = 7.0 was not represented among the unique values, but 7.00 was. I changed the newd vector (newd<-data.frame(Years=c(1,29),Dag="34",Dec_Hour="7.00"), but this didn't solve the problem either. The Dag variable does include the value 34 so that couldn't cause the problem. So yes, an error message with more information wouldn't hurt :)
> 
>  From a biological point of view, the decimal-hours ("Dec_Hour") form of time-of-day doesn't make much more sense than using just full hours ("Hour") in the random effects part. When I changed "Dec_Hour" to "Hour" in the original glmer.nb, things went smoothly (at least after the iteration limit warning). The prediction function
>      (newd4<-data.frame(Years=c(1,29),Dag="34",Hour="7")
>      pred4<-predict(mod4,newd4)).
> also worked well. With and without levels in quotation marks!

  As far as I can tell your "hours_dec" is the decimal version of 
5-minute time periods (the values range from 5.00 to 20.92, with 
intervals of approximately 1/12).

   It's not obvious to me why using a resolution of hours rather than 
5-minute periods should make any difference: is there any chance you can 
provide a reproducible example, i.e. post your data somewhere (or if 
necessary send it to me)?

   At a slightly deeper level (that may not be critical for your 
application), I would be strongly tempted to treat time-of-day as a 
fixed covariate (with a linear or quadratic or possibly a spline-based 
effect of time over the course of the day; a cyclic spline would be 
useful if your data covered the full 24-hour span but I think it's 
unnecessary because the range is only (5-21), possibly in addition to an 
"hour nested within day" random effect to account for less-predictable 
temporal variation ...

> 
> As for point 1.
> This is great news, because negative binomial modelling has been troublesome for me. Being a non-professional statistician, it makes me wonder, though, when and how to apply the var/mean ratio (often seen as a "law" in family selection).

   The main point here is that the variance-mean ratio criterion (which 
is not a 'law', just a rule of thumb) applies to the *conditional 
distribution* (i.e., the distribution once the statistical model has 
been applied, controlling for grouping and covariates).  There are 
methods/rules of thumb for testing for overdispersion based on the 
residuals of a fitted model, e.g. see the GLMM FAQ at 
https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

> 
> Never mind, I changed to a poisson glmer (glmer(Total~  Years + (1|Dag/Dec_Hour), family=poisson)) and (obviously) received no iteration limit warning, but again, the prediction function didn't work. Similar to the above, this problem was solved by changing from decimal to Hours in the random effects part.
> 
> 
>  From this I conclude that the use of Dec_Hour (decimal hour based on a combination of hour and minutes data) is inappropriate in both glmer and glmer.nb when predictions are wanted. Is this information provided anywhere in the documentation of these functions? From my point of view, the interesting thing is that Dec_Hour worked flawless in gam and the predict.gam functions.
> 
> This leaves me in a model selection situation. Based on AIC (list below), the glmer.nb with just full hours seems to be the model of choice, especially because it allows for predictions. Does this choice make sense to you?

   Yes (although if it were my problem I would probably want to dig a 
little deeper to understand the differences in goodness of 
fit/differences in conclusions across the different models/etc.).  (In 
any case, if you can provide a reproducible example so we can understand 
the cause of the prediction failure, that would be nice.)

> 
> Cheers,
> Adjan
> 
> Function and formulaAIC
> glmer(Total~  Years + (1|Dag/Hour), family=poisson)4973
> glmer(Total~  Years + (1|Dag/Dec_Hour), family=poisson)4920  (prediction failure)
> glmer.nb(Total~  Years + (1|Dag/Dec_Hour))4922  (iteration limit warning, prediction failure)
> glmer.nb(Total~  Years + (1|Dag/Hour))4914  (iterarion limit warning)
> 
> 
> 

> lapply(model.frame(mod5)
> $Total
>   [1]  9  3  7 14 10  8 12  5  1  2 11  6 17  4 19 15 13 16 18 21 20  0 22
> 
> $Years
>   [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
> 
> $Dag
>   [1] 33 41 42 47 48 53 59 62 66 67 68 72 77 78 79 31 32 37 38 40 43 46 54 57 61 70 35 50 51 52 56 63 65 75
> [35] 36 39 44 49 58 69 73 34 60 80 45 64 55 74 76 71
> 
> $Dec_Hour
>    [1] 11.42 11.17 15.42  5.33  7.92 15.92 16.92 17.17 10.92 20.42 17.42 14.92  8.42 11.67 16.42 12.17 16.17
>   [18] 17.33 10.17 14.17 17.67 15.67 19.50 13.42  9.50 16.25 10.50 12.50 11.75  6.42 19.08  8.17 14.00 16.00
>   [35] 20.17  7.67 11.08 17.25 11.00 12.42 10.00 13.00 17.83  6.58 13.17 15.08  7.83 14.42  8.00  9.58 14.50
>   [52]  9.25 12.25 18.83  6.67 16.67  8.33 12.00  9.67 12.33  7.50 14.67 15.17 20.92 11.25 19.00  9.00 11.92
>   [69] 12.83 17.50 18.00 18.67 12.58  8.67 20.67  8.50 17.92 13.67 18.25 18.17 11.50 16.33 11.83 13.08 15.50
>   [86] 19.17 18.50 10.75  9.83 13.83 14.75  8.58  9.08 10.83 11.33 15.25 17.75 12.92 18.58 19.67 13.50 13.92
> [103] 10.42 17.00 18.33  5.92  6.33  7.70 15.00 15.83 13.33  6.75  8.83 18.42 19.42 15.05 15.58  5.17 12.67
> [120] 13.75 10.33 12.08  6.83  9.17 10.58 20.58 20.75  6.92 16.83  5.42 12.75  8.92  5.67 14.08 18.08 16.58
> [137] 14.33  5.08  7.17  6.17  9.75 17.08 10.08  5.25  5.27 10.67 20.33 19.92 17.48  6.50 15.33 14.83 16.50
> [154] 20.00  9.92 16.75  5.75  5.83 20.83  7.00 19.33 17.58 18.75 20.25  5.00 19.58 15.75  7.25 16.08  7.75
> [171]  7.33  7.42  5.50 20.08  8.25  8.08  6.00  7.58 13.58 18.92 14.58 14.25 19.75  6.08 19.83  7.08 10.25
> [188]  9.42  9.33  6.38 13.25  6.25 19.25  8.75
> 
> 
> 
> -----Original Message-----
> From: Ben Bolker <bbolker using gmail.com>
> Sent: den 4 mars 2022 21:02
> To: Adriaan de Jong <Adriaan.de.Jong using slu.se>
> Cc: r-sig-mixed-models using r-project.org
> Subject: Re: [R-sig-ME] glmer.nb warning and prediction problem
> 
>    (It wouldn't hurt for us to make the error message slightly more informative by listing *which* new levels were found ...)
> 
> On Fri, Mar 4, 2022 at 3:02 PM Ben Bolker <bbolker using gmail.com> wrote:
>>
>>    (1) tl;dr don't worry about it. It's very, very likely that you have
>> data whose conditional distribution is indistinguishable from a
>> Poisson. I know that the *marginal* variance/mean ratio is large, but
>> once we put in the covariates and the random effects, what's left is
>> presumably equi- or underdispersed (variance ~ mean or variance <
>> mean). The mean counts at the beginning of the time series are about 8
>> (exp(2)), decreasing over time; your dispersion parameter is 798, >>
>> mean - that means the distribution is effectively Poisson. The
>> iteration limit warning comes from the initial call to MASS::glm.nb,
>> which tries to determine an initial guess for the dispersion parameter
>> via an iterative algorithm - it gives up after a while.
>>     You could try a Poisson model - my guess is that it will give
>> practically indistinguishable results (and a likelihood ratio test
>> probably won't reject the null hypothesis that the conditional
>> distribution is Poisson).
>>     Or you could ignore the warning.
>>
>> 2. You might need to put the levels of `Dag` and `Dec_Hour` in
>> quotation marks.  What is `lapply(model.frame(mod5), unique)` ?
>>
>>
>> On Fri, Mar 4, 2022 at 7:02 AM Adriaan de Jong <Adriaan.de.Jong using slu.se> wrote:
>>>
>>> Dear list members,
>>>
>>> What I’m trying to do is to model a linear trend over the years (Years = Year – 1992, 1993 was the starting year of the 29 year data series) of bird numbers (“Total”) with a var/mean ratio of 2.85. The variables “Dag” (= day of the study period starting on April 1 = 1, in this subset ranging from 31 to 80) and “Dec_Hour” (= decimal hour between 4 AM and 9 PM) are nested random effects. The subset for this analysis (Trend1) contains 993 rows (=counting trips).
>>>
>>> The mod5<-glmer.nb(Total~  Years + (1|Dag/Dec_Hour),data=Trend1) command under lme4 gives me the following warning message:
>>>
>>> In theta.ml(Y, mu, weights = object using resp$weights, limit = limit,  :
>>>    iteration limit reached
>>>
>>> I presume I need to provide glmer.nb with a lmerControl string, but have no clue what it should contain.
>>>
>>> I guess that the Dec_Hour may be causing the iteration limit
>>> problem. The model seems to converge, though, and the summary looks
>>> like this
>>>
>>>
>>> Generalized linear mixed model fit by maximum likelihood (Laplace
>>> Approximation) [ glmerMod]
>>>   Family: Negative Binomial(798.6507)  ( log )
>>> Formula: Total ~ Years + (1 | Dag/Dec_Hour)
>>>     Data: Trend1
>>>
>>>       AIC      BIC   logLik deviance df.resid
>>>    4921.7   4946.2  -2455.8   4911.7      988
>>>
>>> Scaled residuals:
>>>      Min      1Q  Median      3Q     Max
>>> -1.9784 -0.6199 -0.0885  0.4511  3.8851
>>>
>>> Random effects:
>>>   Groups       Name        Variance Std.Dev.
>>>   Dec_Hour:Dag (Intercept) 0.15182  0.3896
>>>   Dag          (Intercept) 0.09292  0.3048
>>> Number of obs: 993, groups:  Dec_Hour:Dag, 945; Dag, 50
>>>
>>> Fixed effects:
>>>              Estimate Std. Error z value Pr(>|z|)
>>> (Intercept)  2.09539    0.06025   34.78   <2e-16 ***
>>> Years       -0.04078    0.00248  -16.44   <2e-16 ***
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> Correlation of Fixed Effects:
>>>        (Intr)
>>> Years -0.594
>>>
>>> This makes sense to me, but I would like to predict the 1993 and 2021 endpoint "Total" values of the yearly trend (numerically Years = 1 and 29) given the peak values of observed birds found to be Dag=34 and Dec_Hour=7. For this I tried:
>>>
>>> newd<-data.frame(Years=c(1,29),Dag=34,Dec_Hour=7.0)
>>> pred<-predict(mod5,newd)
>>>
>>> This rendered the message:
>>> Error in levelfun(r, n, allow.new.levels = allow.new.levels) :
>>>    new levels detected in newdata
>>>
>>> Could anyone give me a hint on (a) how to avoid the iteration problem and (b) how to adjust the predict function? Thanks in advance for your help.
>>>
>>> Cheers,
>>> Adjan
>>>
>>> Adriaan "Adjan" de Jong
>>> Associate professor
>>> Swedish University of Agricultural Sciences
>>> ---
>>> När du skickar e-post till SLU så innebär detta att SLU behandlar
>>> dina personuppgifter. För att läsa mer om hur detta går till, klicka
>>> här <https://www.slu.se/om-slu/kontakta-slu/personuppgifter/>
>>> E-mailing SLU will result in SLU processing your personal data. For
>>> more information on how this is done, click here
>>> <https://www.slu.se/en/about-slu/contact-slu/personal-data/>
>>> _______________________________________________
>>> R-sig-mixed-models using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> ---
> När du skickar e-post till SLU så innebär detta att SLU behandlar dina personuppgifter. För att läsa mer om hur detta går till, klicka här <https://www.slu.se/om-slu/kontakta-slu/personuppgifter/>
> E-mailing SLU will result in SLU processing your personal data. For more information on how this is done, click here <https://www.slu.se/en/about-slu/contact-slu/personal-data/>

-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics