[R-sig-ME] Improving residual distribution in glmmTMB

Tue May 16 02:52:00 CEST 2023

   By default (the re.form argument to predict() is set to NULL), the 
predictions include the random effects.  So a 'newdata' data frame like:

ELs   Document  District
20    1         NA
20    2         NA
20    3         NA

would give you the predicted values for document 1, 2, 3, given ELs = 
20, and making a population-level prediction for District (i.e. *not* 
using the District RE/setting this info to 0)

On 2023-05-15 8:46 p.m., Timothy MacKenzie wrote:
> I'm likely misunderstanding your recommendation for using predict()
> for each Document. If our model has no fixed effect for Document, I'm
> wondering how then we can predict() how often each document is
> downloaded?
> 
> Do you mean averaging across predictions for each category of Document as in:
> 
> library(dplyr)
> data.frame(d, fitted = predict(model, type = "response")) %>%
>    group_by(Document) %>% mutate(ave_downs = mean(downs))
> 
> Tim M
> 
> model <- glmmTMB(downs ~ I(ELs/20) +
>                      ## Document +
>                     (1|District)+
>                     (1|Document),
>                   family=truncated_nbinom2(),
>                   data = d)
> 
> On Mon, May 15, 2023 at 6:18 PM Ben Bolker <bbolker using gmail.com> wrote:
>>
>>     In general it doesn't make sense to include any term as both a fixed
>> effect and a random-effect grouping variable (with a few exceptions).
>>
>>     There's no reason you can't examine the model predictions of how
>> often each document is downloaded (i.e., use predict() with each
>> document and whatever baseline covariate settings you want)
>>
>> On 2023-05-15 6:52 p.m., Timothy MacKenzie wrote:
>>> Thank you, Ben, that's very helpful. I noticed that you removed the
>>> fixed effect for "Document", that's, I'm guessing, to improve the
>>> model fit (ex. avoiding singularity)? I was hoping to explore how many
>>> times, on average, each unique Document was downloaded. Looks like
>>> that may not be possible.
>>>
>>> Thanks,
>>> Tim M
>>>
>>>
>>>
>>>
>>>
>>> On Mon, May 15, 2023 at 9:37 AM Ben Bolker <bbolker using gmail.com> wrote:
>>>>
>>>>      I don't have any perfect solution, and I don't necessarily recommend
>>>> the brute-force approach tried below, but a truncated negative binomial
>>>> (2/quadratic parameterization) seems best. The genpois and nbinom1 are
>>>> actually wonkiest, in terms of appearance of residuals.
>>>>      The COM-Poisson below is *very* slow, you might want to skip it (it
>>>> took 20 minutes to fit on 16 cores, and the results are not that great
>>>> anyway).
>>>>      I thought about implementing a Poisson-inverse-Gamma to allow for
>>>> heavier tails, but haven't gotten around to it yet (it's not trivial ...)
>>>>      There seems to be something wonky about the reported sigma/Pearson
>>>> residuals from truncated nbinom2, but the results are OK ...
>>>>
>>>> ### Reproducible data and code:
>>>> d <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
>>>> library(glmmTMB)
>>>> library(purrr)
>>>> library(broom.mixed)
>>>> library(dplyr)
>>>> library(DHARMa)
>>>>
>>>> model <- glmmTMB(downs ~ I(ELs/20) +
>>>>                        ## Document +
>>>>                        (1|District) +
>>>>                        (1|Document),
>>>>                    family=genpois(),
>>>>                         #  ziformula = ~ Document,
>>>>                      # dispformula = ~ Document,
>>>>                    data = d)
>>>>
>>>> ## very slow ...
>>>> system.time(model2 <- update(model, family = "compois",
>>>>                                 control = glmmTMBControl(parallel = 16)))
>>>>
>>>> ##      user    system   elapsed
>>>> ## 17194.076    11.103  1134.367
>>>>
>>>> ## truncated genpois doesn't work for some reason ...
>>>> fam <- c("genpois", "poisson", "nbinom1", "nbinom2")
>>>> fam <- c(fam, paste0("truncated_", fam[-1]))
>>>> mod_list <- lapply(fam,
>>>>                       function(f) {
>>>>                           cat(f, "\n")
>>>>                           update(model, family = f)
>>>> })
>>>> names(mod_list) <- fam
>>>> mod_list[["compois"]] <- model2
>>>>
>>>>
>>>> aictab <- (map_dfr(mod_list, glance, .id = "family")
>>>>        |> select(-c(nobs,logLik, BIC))
>>>>        |> mutate(across(c(AIC, deviance), ~ . - min(., na.rm = TRUE)))
>>>>        |> arrange(AIC)
>>>> )
>>>>
>>>> ##   family              sigma   AIC df.residual deviance
>>>> ##   <chr>               <dbl> <dbl>       <int>    <dbl>
>>>> ## 1 truncated_nbinom2 3.66e-8    0          569      NA
>>>> ## 2 nbinom2           1.39e+0  665.         569       0
>>>> ## 3 compois           5.29e+9  682.         569      NA
>>>> ## 4 genpois           1.33e+1  768.         569      NA
>>>> ## 5 nbinom1           8.11e+0  981.         569     191.
>>>> ## 6 truncated_poisson 1   e+0 1996.         570      NA
>>>> ## 7 poisson           1   e+0 2553.         570    2077.
>>>> ## 8 truncated_nbinom1 7.84e+0   NA          569      NA
>>>>
>>>> tnb <- mod_list[["truncated_nbinom2"]]
>>>> ## Pearson resid calc messed up ???
>>>> plot(resid(tnb,type="pearson") ~ fitted(tnb))
>>>>
>>>> plot(d$downs, predict(tnb, type = "response"), log = "xy")
>>>> abline(a=0, b=1, col = 2)
>>>>
>>>> ## reorder
>>>> mod_list <- mod_list[aictab$family]
>>>> op <- par(mfrow = c(3, 3), mfrow = c(3,2,1,1), las = 1)
>>>> lapply(mod_list,
>>>>           function(m) plot(resid(m,type="pearson") ~ fitted(m),
>>>>                            xlab = "", ylab = "", main = family(m)$family)
>>>>           )
>>>>
>>>>
>>>> ## DHARMa
>>>> ss <- simulateResiduals(tnb)
>>>> plot(ss)
>>>> plotResiduals(ss, form = d$ELs/20)
>>>>
>>>>
>>>>
>>>> On 2023-05-13 11:55 p.m., Timothy MacKenzie wrote:
>>>>> Greetings,
>>>>>
>>>>> I'm modeling the number of downloads (downs) of certain forms
>>>>> (Document) by a number of school districts (District) of various
>>>>> student sizes (ELs).
>>>>>
>>>>> I've tried genpois(), nbinom1(), and nbinom2() families but the fitted
>>>>> vs. residual doesn't look promising.
>>>>>
>>>>> In glmmTMB, are there any additional ways to improve the residual
>>>>> distribution for count-type data?
>>>>>
>>>>> Thank you,
>>>>> Tim M
>>>>> ### Reproducible data and code:
>>>>> d = read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
>>>>>
>>>>> model = glmmTMB(downs ~ I(ELs/20) + Document +
>>>>>                               (1|District) +
>>>>>                               (1|Document),
>>>>>                             family=genpois(),
>>>>>                         #  ziformula = ~ Document,
>>>>>                      # dispformula = ~ Document,
>>>>>                             data = d)
>>>>>
>>>>> plot(resid(model,type="pearson") ~ fitted(model))
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-mixed-models using r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>> --
>>>> Dr. Benjamin Bolker
>>>> Professor, Mathematics & Statistics and Biology, McMaster University
>>>> Director, School of Computational Science and Engineering
>>>> (Acting) Graduate chair, Mathematics & Statistics
>>>>    > E-mail is sent at my convenience; I don't expect replies outside of
>>>> working hours.
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>> --
>> Dr. Benjamin Bolker
>> Professor, Mathematics & Statistics and Biology, McMaster University
>> Director, School of Computational Science and Engineering
>> (Acting) Graduate chair, Mathematics & Statistics
>>   > E-mail is sent at my convenience; I don't expect replies outside of
>> working hours.

-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
 > E-mail is sent at my convenience; I don't expect replies outside of 
working hours.