[R] mgcv: distribution of dev with Poisson data

Greg Dropkin gregd at gn.apc.org
Wed Feb 5 18:50:41 CET 2014


Dear Prof Ripley

yes, but if the estimate is biased it's good to know what the bias is.

The problem illustrated in the simulations has nothing to do with ML,
though, as the default fitting method in mgcv when scale is unknown is
"GCV" and that is what was used, by default, here.

The point about ML was that I'd looked separately at what happens when
scale varies in fitting a quasipoisson model, and found that ML was
minimised (in an example) at the scale estimate (i.e. the default estimate
using dev). A very brief look at the code made that seem plausible as it
does involve the derivatives with respect to scale.

In my limited experience ML is actually better behaved than GCV or REML,
giving more conservative estimates and tighter CIs which behave better
under simulation.

re the Pearson estimates being preferable, that is also what Simon Wood's
book says on p172, another reason why I was puzzled that the default
estimate used dev instead.

thanks

Greg

> On 05/02/2014 12:56, Greg Dropkin wrote:
>> thanks Simon
>> also, it appears at least with ML that the default scale estimate with
quasipoisson (i.e. using dev) is the scale which minimises the ML value
of
>> the fitted model. So it is the "best" model but doesn't actually give the
>> correct mean-variance relation. Is that right?
>
> Well, estimates do not usually give the correct value ....
>
> But ML is biased in many cases, sometimes badly so here, and most people
prefer the Pearson estimator of the dispersion.  This is McCullagn &
Nelder's book, even the first edition, but without much evidence:
Venables & Ripley (2002, §7.5) has plots to back up their hunch.
>
>> thanks again
>> Greg
>>> Greg,
>>> The deviance being chi^2 distributed on the residual degrees of
freedom
>>> works in some cases (mostly where the response itself can be
reasonably
>>> approximated as Gaussian), but rather poorly in others (noteably low
count data). This is what you are seeing in your simulations - in the
first case the Poisson mean is relatively high - so the  normal
approximation to the Poisson is not too bad and the deviance is
approx.
>>> chi.sq on residual df. In the second case the Poisson mean is low, there
>>> are lots of zeroes, and the approximation breaks down. So yes, the
Pearson estimator is probably a better bet in the latter case. best,
>>> Simom
>>> ps. Note that the chi squared approximation for the *difference* in
deviance between two nested models does not suffer from this problem.
On 04/02/14 09:25, Greg Dropkin wrote:
>>>> mgcv: distribution of dev
>>>> hi
>>>> I can't tell if this is a simple error.
>>>> I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
>>>> I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
>>>> In particular, the low values in the 3rd and 4th versions would suggest
>>>> scale < 1, yet the data is Poisson generated.
>>>> The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no
longer
>>>> approx chi-sq on residual df?
>>>> Does it mean the scale estimate when fitting quasipoisson should be the
>>>> Pearson version?
>>>> thanks
>>>> greg
>>>> library(mgcv)
>>>> set.seed(1)
>>>> x1<-runif(1000)
>>>> linp<-2+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linp))
>>>> dev1<-dev2<-sums<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linp))
>>>> sums[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1[j]=m1$dev
>>>> dev2[j]=m2$dev
>>>> }
>>>> mean(sums)
>>>> sd(sums)
>>>> mean(dev1)
>>>> sd(dev1)
>>>> mean(dev2)
>>>> sd(dev2)
>>>> #dev slighly higher than expected
>>>> linpa<--1+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpa))
>>>> dev1a<-dev2a<-sumsa<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpa))
>>>> sumsa[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1a[j]=m1$dev
>>>> dev2a[j]=m2$dev
>>>> }
>>>> mean(sumsa)
>>>> sd(sumsa)
>>>> mean(dev1a)
>>>> sd(dev1a)
>>>> mean(dev2a)
>>>> sd(dev2a)
>>>> #dev a bit lower than expected
>>>> linpb<--1.5+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpb))
>>>> dev1b<-dev2b<-sumsb<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpb))
>>>> sumsb[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1b[j]=m1$dev
>>>> dev2b[j]=m2$dev
>>>> }
>>>> mean(sumsb)
>>>> sd(sumsb)
>>>> mean(dev1b)
>>>> sd(dev1b)
>>>> mean(dev2b)
>>>> sd(dev2b)
>>>> #dev much lower than expected
>>>> m1q<-gam(y~s(x1),family="quasipoisson",scale=-1)
>>>> m1q$scale
>>>> m1q$dev/(1000-sum(m1q$edf))
>>>> #scale estimate < 1
>>>> sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
>>>> #Pearson estimate of scale closer, but also < 1
>>>> linpc<--2+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpc))
>>>> dev1c<-dev2c<-sumsc<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpc))
>>>> sumsc[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1c[j]=m1$dev
>>>> dev2c[j]=m2$dev
>>>> }
>>>> mean(sumsc)
>>>> sd(sumsc)
>>>> mean(dev1c)
>>>> sd(dev1c)
>>>> mean(dev2c)
>>>> sd(dev2c)
>>>> #dev much lower than expected
>>>> m1q<-gam(y~s(x1),family="quasipoisson",scale=-1)
>>>> m1q$scale
>>>> m1q$sig2
>>>> m1q$dev/(1000-sum(m1q$edf))
>>>> #scale estimate < 1
>>>> sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
>>>> #Pearson estimate of scale ok
>>> --
>>> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44
(0)1225 386603               http://people.bath.ac.uk/sw283
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
>




More information about the R-help mailing list