[R] mgcv::gam() scale parameter estimates for quasibinomial error models

Thu Apr 15 10:56:31 CEST 2021

Thanks John. It's a bug in weights handling. mgcv will give wrong scale 
parameter estimates for weighted models where the scale parameter is 
unknown, (except Gaussian, fortunately). quasibinomial with trials > 1 
is one such case, because the weights are used to store the number of 
trials. Otherwise the problem would only arise if weights were 
explicitly provided in a non-Gaussian model with unknown scale 
parameter. Fixed for 1. 8-35. best,

Simon

On 15/04/2021 06:39, John Maindonald wrote:
> For both glm() and mgcv::gam() quasibinomial error models, the summary
> object has a dispersion value that has the same role as sigma^2 in the
> summary object for lm() model fits.
>
> Where some fitted probabilities are small, the `gam()` default scale parameter
> estimates, returned as `scale` (and `sig2`) in the gam object and as “dispersion"
> in the summary object, can differ wildly from the Pearson values that `glm()`
> works with, and that can be obtained by calling `gam()` with a suitable control
> setting (see the code that now follows.)
>
> The following demonstrates the issue:
>
>    ## ‘mgcv’ version 1.8-34
>    Cholera <- HistData:: Cholera
>    library(mgcv)
>    form <- cbind(cholera_deaths, popn-cholera_deaths) ~
>       water + elevation + poor_rate
>    default.gam <- gam(form, data=Cholera, family=quasibinomial)
>    pearson.gam <- update(quasibin.gam, control=gam.control(scale.est=“pearson"))
>
>    c(Pearson=pearson.gam$scale, Default=default.gam$scale)
>    ##  Pearson     Default
>    ## 33.545829  2.919535
>
> My own calculation (from either fitted model), returns 30.07 for the
> (Fletcher 2012) version of the dispersion that was, according to
> Wood’s “Generalized Additive Models” (2nd edn, 2017, p.111),
> returned as the GAM scale estimate at the time when the book was
> written.
>
> The default scale estimates returned by `gam()` vary wildly, relative
> to the relatively stable “pearson" estimates, in data that are simulated
> to have comparable dispersion estimates.
>
> For the Cholera data, it would make good sense to fit a model
> with quasipoisson errors to the death counts, using log(popn) as an
> offset.  The GAM model then uses, with default setting for `scale.est`,
> a scale parameter that is close to that returned by "pearson.gam”.
> SEs (as well as coefficients) are similar to those returned by
> "pearson.gam”.
>
> The detailed calculations that are documented in the following
> documents may be of some general interest.
>    https://www.dropbox.com/s/vl9usat07urbgel/quasibin-gam.pdf?dl=0
>    or https://www.dropbox.com/s/s83mh1mut5xc3gk/quasibin-gam.html?dl=0
>
> I am posting this here now before posting a bug report in case I have
> missed something important.  It would be useful to be directed to the
> mgcv code used for calculation of what is returned as the Fletcher statistic.
>
>    Rmd file: https://www.dropbox.com/s/ghmsdcvgxp068bs/quasibin-gam.Rmd?dl=0
>    .bib file: https://www.dropbox.com/s/r1yjqx0sni2pzjy/quasi.bib?dl=0
>
> John Maindonald             email: john.maindonald using anu.edu.au
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Simon Wood, School of Mathematics, University of Edinburgh,
https://www.maths.ed.ac.uk/~swood34/