[R] p-level in packages mgcv and gam

Sun Oct 2 17:47:25 CEST 2005

Denis Chabot said the following on 2005-09-29 21:55:

> OK, I think I understand better but still have two points to clarify.
> 
> The first one is about the number of df. I think those who replied on  
> this objected to the way I chose df, not the fact that I would run a  
> model with 7.4 df per se. If I read you correctly, I artificially  
> reduce my p-value by using the estimated df found in a mgcv gam model  
> into another model where I fix df. This is fine, I am quite willing  not 
> to run a second model with a fixed df and instead tell my readers  that 
> my model is "marginally significant" with a p-value of 0.03.
> 
> This being said, do you know of guidelines for choosing df? A  colleague 
> told me he does not go above 10% of the number of points.  Should I be 
> concerned when mgcv estimates 7.4 df for 34 points? Note  that for this 
> particular model P < 1e-16, and P is also very small if  I fix df to 
> either 4 or 7.
> 
> My second point is the difference between models fitted by packages  gam 
> and mgcv. Sure, some of you have said "different algorithms". And  when 

Maybe that wasn't well put. Think of it as different implementations. 
The packages `mgcv' and `gam' are by no means the only implementations 
of GAM; see e.g. the `gss' package.

> I specify dfs, shouldn't P-values be very similar for the 2  packages? 
> If not, what does it say of the confidence we can have in  the models?

Since there's no universally accepted definition of what a GAM is, you 
shouldn't expect the results to be the same.

> I draw your attention to this exampl: I obtained P-values of 0.50 and  
> 0.03 with packages gam and mgcv respectively:

In this case, however, you're trying to compare apples to oranges...

>  > library(gam)
> Loading required package: splines
>  > data(kyphosis)
>  > kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis)
>  > summary.gam(kyp1)
> 
> Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data  = 
> kyphosis)
> Deviance Residuals:
>     Min      1Q  Median      3Q     Max
> -1.3646 -0.6233 -0.4853 -0.3133  2.0965
> 
> (Dispersion Parameter for binomial family taken to be 1)
> 
>     Null Deviance: 83.2345 on 80 degrees of freedom
> Residual Deviance: 71.9973 on 76.9999 degrees of freedom
> AIC: 79.9976
> 
> Number of Local Scoring Iterations: 7
> 
> DF for Terms and Chi-squares for Nonparametric Effects
> 
>              Df Npar Df Npar Chisq  P(Chi)
> (Intercept)   1
> s(Number, 3)  1       2    1.37149 0.50375

This test concerns only the non-linear part of the term s(Number, 3). In 
order to simultaneously test both the linear and non-linear part, as 
mgcv::summary.gam does, you'd

 > kyp1.1 <- gam(Kyphosis ~ 1, family=binomial, data=kyphosis)
 > anova(kyp1.1, kyp1, test = "Chisq")
Analysis of Deviance Table

Model 1: Kyphosis ~ 1
Model 2: Kyphosis ~ s(Number, 3)
   Resid. Df Resid. Dev      Df Deviance P(>|Chi|)
1   80.0000     83.234
2   76.9999     71.997  3.0001   11.237     0.011

HTH,
Henric

>  > detach(package:gam)
>  > library(mgcv)
> This is mgcv 1.3-7
>  > kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,  
> data=kyphosis)
>  > summary.gam(kyp2)
> 
> Family: binomial
> Link function: logit
> 
> Formula:
> Kyphosis ~ s(Number, k = 4, fx = T)
> 
> Parametric coefficients:
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept)  -1.5504     0.3342   -4.64 3.49e-06 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Approximate significance of smooth terms:
>           edf Est.rank Chi.sq p-value
> s(Number)   3        3  8.898  0.0307 *
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> R-sq.(adj) =  0.101   Deviance explained = 12.5%
> UBRE score = 0.075202  Scale est. = 1         n = 81
>  > kyp2$deviance
> [1] 72.85893
>  > kyp2$null.deviance
> [1] 83.23447
>  > kyp2$df.null
> [1] 80
>  > kyp2$df.residual
> [1] 77
> 
> How can we explain this huge difference?
> 
> Denis
> 
> 
>> Le 05-09-29 à 06:00, r-help-request at stat.math.ethz.ch a écrit :
>>
>> De : Henric Nilsson <henric.nilsson at statisticon.se>
>> Date : 29 septembre 2005 03:55:19 HAE
>> À : ym at climpact.com
>> Cc : r-help at stat.math.ethz.ch
>> Objet : Rép : [R] p-level in packages mgcv and gam
>> Répondre à : Henric Nilsson <henric.nilsson at statisticon.se>
>>
>>
>> Yves Magliulo said the following on 2005-09-28 17:05:
>>
>>> hi,
>>> i'll try to help you, i send a mail about this subject last  week... and
>>> i did not have any response...
>>> I'm using gam from package mgcv. 1)
>>> How to interpret the significance of smooth terms is hard for me to
>>> understand perfectly : using UBRE, you fix df. p-value are  estimated 
>>> by chi-sq distribution using GCV, the best df are  estimated by GAM. 
>>> (that's what i want) and
>>> p-values
>>
>>
>>
>> This is not correct. The df are estimated in both cases (i.e. UBRE  
>> and GCV), but the scale parameter is fixed in the UBRE case. Hence,  
>> by default UBRE is used for family = binomial or poisson since the  
>> scale parameter is assumed to be 1. Similarly, GCV is the default  for 
>> family = gaussian since we most often want the scale (usually  denoted 
>> sigma^2) to be estimated.
>>
>>
>>> are estimated by an F distribution But in that case they said "use at
>>> your own risk" in ?summary.gam
>>
>>
>>
>> The warning applies in both cases. The p-values are conditional on  
>> the smoothing parameters, and the uncertainty of the smooths is not  
>> taken into account when computing the p-values.
>>
>>
>>> so you can also look at the chi.sq : but i don't know how to choose a
>>
>>
>>
>> No...
>>
>>
>>> criterion like for p-values... for me, chi.sq show the best  
>>> predictor in
>>> a model, but it's hard to reject one with it.
>>
>>
>>
>> Which version of mgcv do you use? The confusion probably stems from  
>> earlier versions of mgcv (< 1.3-5): the summary and anova methods  
>> used to have a column denoted Chi.sq even when the displayed  
>> statistic was computed as F. Recent versions of mgcv has
>>
>> > summary(b)
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>>
>> Parametric coefficients:
>>             Estimate Std. Error t value Pr(>|t|)
>> (Intercept)   7.9150     0.1049   75.44   <2e-16 ***
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Approximate significance of smooth terms:
>>         edf Est.rank      F  p-value
>> s(x0) 5.173    9.000  3.785 0.000137 ***
>> s(x1) 2.357    9.000 34.631  < 2e-16 ***
>> s(x2) 8.517    9.000 84.694  < 2e-16 ***
>> s(x3) 1.000    1.000  0.444 0.505797
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> R-sq.(adj) =  0.726   Deviance explained = 73.7%
>> GCV score =  4.611   Scale est. = 4.4029    n = 400
>>
>>
>> If we assume that the scale is known and fixed at 4.4029, we get
>>
>> > summary(b, dispersion = 4.4029)
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>>
>> Parametric coefficients:
>>             Estimate Std. Error z value Pr(>|z|)
>> (Intercept)   7.9150     0.1049   75.44   <2e-16 ***
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Approximate significance of smooth terms:
>>         edf Est.rank  Chi.sq p-value
>> s(x0) 5.173    9.000  34.067 8.7e-05 ***
>> s(x1) 2.357    9.000 311.679 < 2e-16 ***
>> s(x2) 8.517    9.000 762.255 < 2e-16 ***
>> s(x3) 1.000    1.000   0.444   0.505
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> R-sq.(adj) =  0.726   Deviance explained = 73.7%
>> GCV score =  4.611   Scale est. = 4.4029    n = 400
>>
>> Note that t/F changed into z/Chi.sq.
>>
>>
>>> so as far as i m concerned, i use GCV methods, and fix a 5% on the  null
>>> hypothesis (pvalue) to select significant predictor. after, i look  
>>> at my
>>> smooth, and if the parametrization look fine to me, i validate.
>>> generaly, for p-values smaller than 0.001, you can be confident. over
>>> 0.001, you have to check. 2)
>>> for difference between package gam and mgcv, i sent a mail about this
>>
>>
>>
>> The underlying algorithms are very different.
>>
>>
>> HTH,
>> Henric
>>
>>
>> De : "Liaw, Andy" <andy_liaw at merck.com>
>> Date : 28 septembre 2005 14:01:25 HAE
>> À : 'Denis Chabot' <chabotd at globetrotter.net>, Peter Dalgaard  
>> <p.dalgaard at biostat.ku.dk>
>> Cc : Thomas Lumley <tlumley at u.washington.edu>, R list <r- 
>> help at stat.math.ethz.ch>
>> Objet : RE: [R] p-level in packages mgcv and gam
>>
>> Just change the df in what Thomas described to the degree of  
>> polynomial, and
>> everything he said still applies.  Any good book on regression that  
>> covers
>> polynomial regression ought to point this out.
>>
>> Andy
>>
>>
>>> From: Denis Chabot
>>>
>>> But what about another analogy, that of polynomials? You may not be
>>> sure what degree polynomial to use, and you have not decided before
>>> analysing your data. You fit different polynomials to your data,
>>> checking if added degrees increase r2 sufficiently by doing F-tests.
>>>
>>> I thought it was the same thing with GAMs. You can fit a
>>> model with 4
>>> df, and in some cases it is of interest to see if this is a better
>>> fit than a linear fit. But why can't you also check if 7df is better
>>> than 4df? And if you used mgcv first and it tells you that 7df is
>>> better than 4df, why bother repeating the comparison 7df
>>> against 4df,
>>> why not just take the p-value for the model with 7df (fixed)?
>>>
>>> Denis
>>
>>
> 
> 
>> De : Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>> Date : 28 septembre 2005 12:04:58 HAE
>> À : Thomas Lumley <tlumley at u.washington.edu>
>> Cc : Denis Chabot <chabotd at globetrotter.net>, R list <r- 
>> help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> Thomas Lumley <tlumley at u.washington.edu> writes:
>>
>>>
>>> Bob, on the other hand, chooses the amount of smoothing depending on
>>> the data. When a 4 df smooth fits best he ends up with the same model
>>> as Alice and the same p-value.  When some other df fits best he ends
>>> up with a different model and a *smaller* p-value than Alice.
>>
>>
>> This doesn't actually follow, unless the p-value (directly or
>> indirectly) found its way into the definition of "best fit". It does
>> show the danger, though.
>>
>> -- 
>>    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>  (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45)  
>> 35327918
>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45)  
>> 35327907
>>
> 
>> De : Thomas Lumley <tlumley at u.washington.edu>
>> Date : 26 septembre 2005 12:54:43 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : r-help at stat.math.ethz.ch
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Mon, 26 Sep 2005, Denis Chabot wrote:
>>
>> But the mgcv manual warns that p-level for the smooth can be
>> underestimated when df are estimated by the model. Most of the time
>> my p-levels are so small that even doubling them would not result in
>> a value close to the P=0.05 threshold, but I have one case with  P=0.033.
>>
>> I thought, probably naively, that running a second model with fixed
>> df, using the value of df found in the first model. I could not
>> achieve this with mgcv: its gam function does not seem to accept
>> fractional values of df (in my case 8.377).
>>
>> No, this won't work.  The problem is the usual one with model  
>> selection: the p-value is calculated as if the df had been fixed,  
>> when really it was estimated.
>>
>> It is likely to be quite hard to get an honest p-value out of  
>> something that does adaptive smoothing.
>>
>>     -thomas
>> De : Thomas Lumley <tlumley at u.washington.edu>
>> Date : 28 septembre 2005 11:33:27 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : R list <r-help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Wed, 28 Sep 2005, Denis Chabot wrote:
>>
>> I only got one reply to my message:
>>
>> No, this won't work.  The problem is the usual one with model
>> selection: the p-value is calculated as if the df had been fixed,
>> when really it was estimated.
>>
>> It is likely to be quite hard to get an honest p-value out of
>> something that does adaptive smoothing.
>>
>>     -thomas
>>
>> I do not understand this: it seems that a lot of people chose df=4
>> for no particular reason, but p-levels are correct. If instead I
>> choose df=8 because a previous model has estimated this to be an
>> optimal df, P-levels are no good because df are estimated?
>>
>> Yes. I know this sounds strange initially, but it really does make  
>> sense if you think about it carefully.
>>
>> Suppose that Alice and Bob are kyphosis researchers, and that Alice  
>> always chooses 4df for smoothing Age.  We would all agree that her  
>> p-values are correct [in fact we wouldn't, but that is a separate  issue]
>>
>> Bob, on the other hand, chooses the amount of smoothing depending  on 
>> the data. When a 4 df smooth fits best he ends up with the same  model 
>> as Alice and the same p-value.  When some other df fits best  he ends 
>> up with a different model and a *smaller* p-value than Alice.
>>
>> In particular, this is still true under the null hypothesis that  Age 
>> has no effect [If Alice and Bob are interested in p-values, the  null 
>> hypothesis must be plausible.]
>>
>> If Bob's p-values are always less than or equal to Alice's p-values  
>> under the null hypothesis, and Alice's p-values are less than 0.05  5% 
>> of the time, then Bob's p-values are less than 0.05 more than 5%  of 
>> the time.
>>
>>
>>     -thomas
>>
>>
>> Furthermore, shouldn't packages gam and mgcv give similar results
>> when the same data and df are used? I tried this:
>>
>> library(gam)
>> data(kyphosis)
>> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
>> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
>> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
>> anova.gam(kyp1)
>> anova.gam(kyp2)
>> anova.gam(kyp3)
>>
>> detach(package:gam)
>> library(mgcv)
>> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T),  family=binomial,
>> data=kyphosis)
>> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,
>> data=kyphosis)
>> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T),  family=binomial,
>> data=kyphosis)
>> anova.gam(kyp4)
>> anova.gam(kyp5)
>> anova.gam(kyp6)
>>
>>
>> P levels for these models, by pair
>>
>> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
>> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
>> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)
>>
>> Also if you plot all these you find that the mgcv plots are smoother
>> than the gam plots, even the same df are used all the time.
>>
>> I am really confused now!
>>
>> Denis
>>
> 
>>
>> Thomas Lumley            Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu    University of Washington, SeattleDe :  
>> Thomas Lumley <tlumley at u.washington.edu>
>> Date : 28 septembre 2005 14:35:26 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : Peter Dalgaard <p.dalgaard at biostat.ku.dk>, R list <r- 
>> help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Wed, 28 Sep 2005, Denis Chabot wrote:
>>
>> But what about another analogy, that of polynomials? You may not be  
>> sure what degree polynomial to use, and you have not decided before  
>> analysing your data. You fit different polynomials to your data,  
>> checking if added degrees increase r2 sufficiently by doing F-tests.
>>
>> Yes, you can. And this procedure gives you incorrect p-values.
>>
>>  They may not be very incorrect -- it depends on how much model  
>> selection you do, and how strongly the feature you are selecting on  
>> is related to the one you are testing.
>>
>> For example, using step() to choose a polynomial in x even when x  is 
>> unrelated to y and z inflates the Type I error rate by giving a  
>> biased estimate of the residual mean squared error:
>>
>> once<-function(){
>>   y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)
>>   summary(step(lm(y~z),
>>         scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),
>>         trace=0))$coef["z",4]
>>  }
>> p<-replicate(1000,once())
>> mean(p<0.05)
>> [1] 0.072
>>
>> which is significantly higher than you would expect for an honest  
>> level 0.05 test.
>>
>>     -thomas
> 
>