[R-sig-ME] Improving residual distribution in glmmTMB
Timothy MacKenzie
|@w|@wt @end|ng |rom gm@||@com
Tue May 16 03:45:59 CEST 2023
Many thanks, I wasn't aware of this. Looking forward to one day trying
the same data with the family=Poisson-inverse-Gamma().
Much appreciated,
Tim M
On Mon, May 15, 2023 at 7:52 PM Ben Bolker <bbolker using gmail.com> wrote:
>
> By default (the re.form argument to predict() is set to NULL), the
> predictions include the random effects. So a 'newdata' data frame like:
>
> ELs Document District
> 20 1 NA
> 20 2 NA
> 20 3 NA
>
> would give you the predicted values for document 1, 2, 3, given ELs =
> 20, and making a population-level prediction for District (i.e. *not*
> using the District RE/setting this info to 0)
>
> On 2023-05-15 8:46 p.m., Timothy MacKenzie wrote:
> > I'm likely misunderstanding your recommendation for using predict()
> > for each Document. If our model has no fixed effect for Document, I'm
> > wondering how then we can predict() how often each document is
> > downloaded?
> >
> > Do you mean averaging across predictions for each category of Document as in:
> >
> > library(dplyr)
> > data.frame(d, fitted = predict(model, type = "response")) %>%
> > group_by(Document) %>% mutate(ave_downs = mean(downs))
> >
> > Tim M
> >
> > model <- glmmTMB(downs ~ I(ELs/20) +
> > ## Document +
> > (1|District)+
> > (1|Document),
> > family=truncated_nbinom2(),
> > data = d)
> >
> > On Mon, May 15, 2023 at 6:18 PM Ben Bolker <bbolker using gmail.com> wrote:
> >>
> >> In general it doesn't make sense to include any term as both a fixed
> >> effect and a random-effect grouping variable (with a few exceptions).
> >>
> >> There's no reason you can't examine the model predictions of how
> >> often each document is downloaded (i.e., use predict() with each
> >> document and whatever baseline covariate settings you want)
> >>
> >> On 2023-05-15 6:52 p.m., Timothy MacKenzie wrote:
> >>> Thank you, Ben, that's very helpful. I noticed that you removed the
> >>> fixed effect for "Document", that's, I'm guessing, to improve the
> >>> model fit (ex. avoiding singularity)? I was hoping to explore how many
> >>> times, on average, each unique Document was downloaded. Looks like
> >>> that may not be possible.
> >>>
> >>> Thanks,
> >>> Tim M
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, May 15, 2023 at 9:37 AM Ben Bolker <bbolker using gmail.com> wrote:
> >>>>
> >>>> I don't have any perfect solution, and I don't necessarily recommend
> >>>> the brute-force approach tried below, but a truncated negative binomial
> >>>> (2/quadratic parameterization) seems best. The genpois and nbinom1 are
> >>>> actually wonkiest, in terms of appearance of residuals.
> >>>> The COM-Poisson below is *very* slow, you might want to skip it (it
> >>>> took 20 minutes to fit on 16 cores, and the results are not that great
> >>>> anyway).
> >>>> I thought about implementing a Poisson-inverse-Gamma to allow for
> >>>> heavier tails, but haven't gotten around to it yet (it's not trivial ...)
> >>>> There seems to be something wonky about the reported sigma/Pearson
> >>>> residuals from truncated nbinom2, but the results are OK ...
> >>>>
> >>>> ### Reproducible data and code:
> >>>> d <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
> >>>> library(glmmTMB)
> >>>> library(purrr)
> >>>> library(broom.mixed)
> >>>> library(dplyr)
> >>>> library(DHARMa)
> >>>>
> >>>> model <- glmmTMB(downs ~ I(ELs/20) +
> >>>> ## Document +
> >>>> (1|District) +
> >>>> (1|Document),
> >>>> family=genpois(),
> >>>> # ziformula = ~ Document,
> >>>> # dispformula = ~ Document,
> >>>> data = d)
> >>>>
> >>>> ## very slow ...
> >>>> system.time(model2 <- update(model, family = "compois",
> >>>> control = glmmTMBControl(parallel = 16)))
> >>>>
> >>>> ## user system elapsed
> >>>> ## 17194.076 11.103 1134.367
> >>>>
> >>>> ## truncated genpois doesn't work for some reason ...
> >>>> fam <- c("genpois", "poisson", "nbinom1", "nbinom2")
> >>>> fam <- c(fam, paste0("truncated_", fam[-1]))
> >>>> mod_list <- lapply(fam,
> >>>> function(f) {
> >>>> cat(f, "\n")
> >>>> update(model, family = f)
> >>>> })
> >>>> names(mod_list) <- fam
> >>>> mod_list[["compois"]] <- model2
> >>>>
> >>>>
> >>>> aictab <- (map_dfr(mod_list, glance, .id = "family")
> >>>> |> select(-c(nobs,logLik, BIC))
> >>>> |> mutate(across(c(AIC, deviance), ~ . - min(., na.rm = TRUE)))
> >>>> |> arrange(AIC)
> >>>> )
> >>>>
> >>>> ## family sigma AIC df.residual deviance
> >>>> ## <chr> <dbl> <dbl> <int> <dbl>
> >>>> ## 1 truncated_nbinom2 3.66e-8 0 569 NA
> >>>> ## 2 nbinom2 1.39e+0 665. 569 0
> >>>> ## 3 compois 5.29e+9 682. 569 NA
> >>>> ## 4 genpois 1.33e+1 768. 569 NA
> >>>> ## 5 nbinom1 8.11e+0 981. 569 191.
> >>>> ## 6 truncated_poisson 1 e+0 1996. 570 NA
> >>>> ## 7 poisson 1 e+0 2553. 570 2077.
> >>>> ## 8 truncated_nbinom1 7.84e+0 NA 569 NA
> >>>>
> >>>> tnb <- mod_list[["truncated_nbinom2"]]
> >>>> ## Pearson resid calc messed up ???
> >>>> plot(resid(tnb,type="pearson") ~ fitted(tnb))
> >>>>
> >>>> plot(d$downs, predict(tnb, type = "response"), log = "xy")
> >>>> abline(a=0, b=1, col = 2)
> >>>>
> >>>> ## reorder
> >>>> mod_list <- mod_list[aictab$family]
> >>>> op <- par(mfrow = c(3, 3), mfrow = c(3,2,1,1), las = 1)
> >>>> lapply(mod_list,
> >>>> function(m) plot(resid(m,type="pearson") ~ fitted(m),
> >>>> xlab = "", ylab = "", main = family(m)$family)
> >>>> )
> >>>>
> >>>>
> >>>> ## DHARMa
> >>>> ss <- simulateResiduals(tnb)
> >>>> plot(ss)
> >>>> plotResiduals(ss, form = d$ELs/20)
> >>>>
> >>>>
> >>>>
> >>>> On 2023-05-13 11:55 p.m., Timothy MacKenzie wrote:
> >>>>> Greetings,
> >>>>>
> >>>>> I'm modeling the number of downloads (downs) of certain forms
> >>>>> (Document) by a number of school districts (District) of various
> >>>>> student sizes (ELs).
> >>>>>
> >>>>> I've tried genpois(), nbinom1(), and nbinom2() families but the fitted
> >>>>> vs. residual doesn't look promising.
> >>>>>
> >>>>> In glmmTMB, are there any additional ways to improve the residual
> >>>>> distribution for count-type data?
> >>>>>
> >>>>> Thank you,
> >>>>> Tim M
> >>>>> ### Reproducible data and code:
> >>>>> d = read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
> >>>>>
> >>>>> model = glmmTMB(downs ~ I(ELs/20) + Document +
> >>>>> (1|District) +
> >>>>> (1|Document),
> >>>>> family=genpois(),
> >>>>> # ziformula = ~ Document,
> >>>>> # dispformula = ~ Document,
> >>>>> data = d)
> >>>>>
> >>>>> plot(resid(model,type="pearson") ~ fitted(model))
> >>>>>
> >>>>> _______________________________________________
> >>>>> R-sig-mixed-models using r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>>>
> >>>> --
> >>>> Dr. Benjamin Bolker
> >>>> Professor, Mathematics & Statistics and Biology, McMaster University
> >>>> Director, School of Computational Science and Engineering
> >>>> (Acting) Graduate chair, Mathematics & Statistics
> >>>> > E-mail is sent at my convenience; I don't expect replies outside of
> >>>> working hours.
> >>>>
> >>>> _______________________________________________
> >>>> R-sig-mixed-models using r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>
> >> --
> >> Dr. Benjamin Bolker
> >> Professor, Mathematics & Statistics and Biology, McMaster University
> >> Director, School of Computational Science and Engineering
> >> (Acting) Graduate chair, Mathematics & Statistics
> >> > E-mail is sent at my convenience; I don't expect replies outside of
> >> working hours.
>
> --
> Dr. Benjamin Bolker
> Professor, Mathematics & Statistics and Biology, McMaster University
> Director, School of Computational Science and Engineering
> (Acting) Graduate chair, Mathematics & Statistics
> > E-mail is sent at my convenience; I don't expect replies outside of
> working hours.
More information about the R-sig-mixed-models
mailing list