[R] mgcv: distribution of dev with Poisson data

Wed Feb 5 16:39:24 CET 2014

hi Simon

yes, I also got the right shape of the mean-variance relation but the
wrong estimate of the parameter.

thanks very much

Greg

> Hi Greg,
>
> Yes, this sounds right - with quasipoisson gam uses `extended
> quasi-likelihood' (see McCullagh and Nelder's GLM book) to allow
> estimation of the scale parameter along with the smoothing parameters
> via (RE)ML, and it could well be that this gives a biased scale estimate
> with low counts (although the shape of the mean variance relationship is
> unaffected by this).
>
> It might well be more sensible to default to a Pearson estimate of the
> scale parameter for 'gam' ('bam' and 'gamm' already do this), to avoid
> this issue with quasipoisson and low counts. Your email reminded me that
> someone had reported rather too high p-values for quasipoisson with
> these sort of data, and looking back at my list of open issues I see
> that `someone' was you as well! It's possible that moving to a Pearson
> estimator of the scale will solve that problem too.
>
> Thanks for this... very helpful.
>
> best,
> Simon
>
>
> On 05/02/14 12:56, Greg Dropkin wrote:
>> thanks Simon
>>
>> also, it appears at least with ML that the default scale estimate with
>> quasipoisson (i.e. using dev) is the scale which minimises the ML value
>> of
>> the fitted model. So it is the "best" model but doesn't actually give
>> the
>> correct mean-variance relation. Is that right?
>>
>> thanks again
>>
>> Greg
>>
>>> Greg,
>>>
>>> The deviance being chi^2 distributed on the residual degrees of freedom
>>> works in some cases (mostly where the response itself can be reasonably
>>> approximated as Gaussian), but rather poorly in others (noteably low
>>> count data). This is what you are seeing in your simulations - in the
>>> first case the Poisson mean is relatively high - so the  normal
>>> approximation to the Poisson is not too bad and the deviance is approx.
>>> chi.sq on residual df. In the second case the Poisson mean is low,
>>> there
>>> are lots of zeroes, and the approximation breaks down. So yes, the
>>> Pearson estimator is probably a better bet in the latter case.
>>>
>>> best,
>>> Simom
>>>
>>> ps. Note that the chi squared approximation for the *difference* in
>>> deviance between two nested models does not suffer from this problem.
>>>
>>>
>>> On 04/02/14 09:25, Greg Dropkin wrote:
>>>> mgcv: distribution of dev
>>>>
>>>> hi
>>>>
>>>> I can't tell if this is a simple error.
>>>> I'm puzzled by the distribution of dev when fitting a gam to Poisson
>>>> generated data.
>>>> I expected dev to be approximately chi-squared on residual d.f., i.e.
>>>> about 1000 in each case below.
>>>> In particular, the low values in the 3rd and 4th versions would
>>>> suggest
>>>> scale < 1, yet the data is Poisson generated.
>>>> The problem isn't caused by insufficient k values in the smoother.
>>>> Does this mean that with sparse data the distribution of dev is no
>>>> longer
>>>> approx chi-sq on residual df?
>>>> Does it mean the scale estimate when fitting quasipoisson should be
>>>> the
>>>> Pearson version?
>>>>
>>>> thanks
>>>>
>>>> greg
>>>>
>>>> library(mgcv)
>>>> set.seed(1)
>>>> x1<-runif(1000)
>>>> linp<-2+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linp))
>>>> dev1<-dev2<-sums<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linp))
>>>> sums[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1[j]=m1$dev
>>>> dev2[j]=m2$dev
>>>> }
>>>> mean(sums)
>>>> sd(sums)
>>>> mean(dev1)
>>>> sd(dev1)
>>>> mean(dev2)
>>>> sd(dev2)
>>>>
>>>> #dev slighly higher than expected
>>>>
>>>> linpa<--1+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpa))
>>>> dev1a<-dev2a<-sumsa<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpa))
>>>> sumsa[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1a[j]=m1$dev
>>>> dev2a[j]=m2$dev
>>>> }
>>>> mean(sumsa)
>>>> sd(sumsa)
>>>> mean(dev1a)
>>>> sd(dev1a)
>>>> mean(dev2a)
>>>> sd(dev2a)
>>>>
>>>> #dev a bit lower than expected
>>>>
>>>> linpb<--1.5+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpb))
>>>> dev1b<-dev2b<-sumsb<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpb))
>>>> sumsb[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1b[j]=m1$dev
>>>> dev2b[j]=m2$dev
>>>> }
>>>> mean(sumsb)
>>>> sd(sumsb)
>>>> mean(dev1b)
>>>> sd(dev1b)
>>>> mean(dev2b)
>>>> sd(dev2b)
>>>>
>>>> #dev much lower than expected
>>>>
>>>> m1q<-gam(y~s(x1),family="quasipoisson",scale=-1)
>>>> m1q$scale
>>>> m1q$dev/(1000-sum(m1q$edf))
>>>>
>>>> #scale estimate < 1
>>>>
>>>> sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
>>>>
>>>> #Pearson estimate of scale closer, but also < 1
>>>>
>>>>
>>>> linpc<--2+exp(-2*x1)*sin(8*x1)
>>>> sum(exp(linpc))
>>>> dev1c<-dev2c<-sumsc<-rep(0,20)
>>>> for (j in 1:20)
>>>> {
>>>> y<-rpois(1000,exp(linpc))
>>>> sumsc[j]<-sum(y)
>>>> m1<-gam(y~s(x1),family="poisson")
>>>> m2<-gam(y~s(x1,k=20),family="poisson")
>>>> dev1c[j]=m1$dev
>>>> dev2c[j]=m2$dev
>>>> }
>>>> mean(sumsc)
>>>> sd(sumsc)
>>>> mean(dev1c)
>>>> sd(dev1c)
>>>> mean(dev2c)
>>>> sd(dev2c)
>>>>
>>>> #dev much lower than expected
>>>>
>>>> m1q<-gam(y~s(x1),family="quasipoisson",scale=-1)
>>>> m1q$scale
>>>> m1q$sig2
>>>> m1q$dev/(1000-sum(m1q$edf))
>>>>
>>>> #scale estimate < 1
>>>>
>>>> sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
>>>>
>>>> #Pearson estimate of scale ok
>>>>
>>>>
>>>
>>>
>>> --
>>> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
>>> +44 (0)1225 386603               http://people.bath.ac.uk/sw283
>>>
>>>
>>
>>
>
>
> --
> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
> +44 (0)1225 386603               http://people.bath.ac.uk/sw283
>
>