[R] p-level in packages mgcv and gam
Henric Nilsson
henric.nilsson at statisticon.se
Sun Oct 2 17:47:25 CEST 2005
Denis Chabot said the following on 2005-09-29 21:55:
> OK, I think I understand better but still have two points to clarify.
>
> The first one is about the number of df. I think those who replied on
> this objected to the way I chose df, not the fact that I would run a
> model with 7.4 df per se. If I read you correctly, I artificially
> reduce my p-value by using the estimated df found in a mgcv gam model
> into another model where I fix df. This is fine, I am quite willing not
> to run a second model with a fixed df and instead tell my readers that
> my model is "marginally significant" with a p-value of 0.03.
>
> This being said, do you know of guidelines for choosing df? A colleague
> told me he does not go above 10% of the number of points. Should I be
> concerned when mgcv estimates 7.4 df for 34 points? Note that for this
> particular model P < 1e-16, and P is also very small if I fix df to
> either 4 or 7.
>
> My second point is the difference between models fitted by packages gam
> and mgcv. Sure, some of you have said "different algorithms". And when
Maybe that wasn't well put. Think of it as different implementations.
The packages `mgcv' and `gam' are by no means the only implementations
of GAM; see e.g. the `gss' package.
> I specify dfs, shouldn't P-values be very similar for the 2 packages?
> If not, what does it say of the confidence we can have in the models?
Since there's no universally accepted definition of what a GAM is, you
shouldn't expect the results to be the same.
> I draw your attention to this exampl: I obtained P-values of 0.50 and
> 0.03 with packages gam and mgcv respectively:
In this case, however, you're trying to compare apples to oranges...
> > library(gam)
> Loading required package: splines
> > data(kyphosis)
> > kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis)
> > summary.gam(kyp1)
>
> Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data =
> kyphosis)
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -1.3646 -0.6233 -0.4853 -0.3133 2.0965
>
> (Dispersion Parameter for binomial family taken to be 1)
>
> Null Deviance: 83.2345 on 80 degrees of freedom
> Residual Deviance: 71.9973 on 76.9999 degrees of freedom
> AIC: 79.9976
>
> Number of Local Scoring Iterations: 7
>
> DF for Terms and Chi-squares for Nonparametric Effects
>
> Df Npar Df Npar Chisq P(Chi)
> (Intercept) 1
> s(Number, 3) 1 2 1.37149 0.50375
This test concerns only the non-linear part of the term s(Number, 3). In
order to simultaneously test both the linear and non-linear part, as
mgcv::summary.gam does, you'd
> kyp1.1 <- gam(Kyphosis ~ 1, family=binomial, data=kyphosis)
> anova(kyp1.1, kyp1, test = "Chisq")
Analysis of Deviance Table
Model 1: Kyphosis ~ 1
Model 2: Kyphosis ~ s(Number, 3)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 80.0000 83.234
2 76.9999 71.997 3.0001 11.237 0.011
HTH,
Henric
> > detach(package:gam)
> > library(mgcv)
> This is mgcv 1.3-7
> > kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
> data=kyphosis)
> > summary.gam(kyp2)
>
> Family: binomial
> Link function: logit
>
> Formula:
> Kyphosis ~ s(Number, k = 4, fx = T)
>
> Parametric coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -1.5504 0.3342 -4.64 3.49e-06 ***
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> Approximate significance of smooth terms:
> edf Est.rank Chi.sq p-value
> s(Number) 3 3 8.898 0.0307 *
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> R-sq.(adj) = 0.101 Deviance explained = 12.5%
> UBRE score = 0.075202 Scale est. = 1 n = 81
> > kyp2$deviance
> [1] 72.85893
> > kyp2$null.deviance
> [1] 83.23447
> > kyp2$df.null
> [1] 80
> > kyp2$df.residual
> [1] 77
>
> How can we explain this huge difference?
>
> Denis
>
>
>> Le 05-09-29 à 06:00, r-help-request at stat.math.ethz.ch a écrit :
>>
>> De : Henric Nilsson <henric.nilsson at statisticon.se>
>> Date : 29 septembre 2005 03:55:19 HAE
>> À : ym at climpact.com
>> Cc : r-help at stat.math.ethz.ch
>> Objet : Rép : [R] p-level in packages mgcv and gam
>> Répondre à : Henric Nilsson <henric.nilsson at statisticon.se>
>>
>>
>> Yves Magliulo said the following on 2005-09-28 17:05:
>>
>>> hi,
>>> i'll try to help you, i send a mail about this subject last week... and
>>> i did not have any response...
>>> I'm using gam from package mgcv. 1)
>>> How to interpret the significance of smooth terms is hard for me to
>>> understand perfectly : using UBRE, you fix df. p-value are estimated
>>> by chi-sq distribution using GCV, the best df are estimated by GAM.
>>> (that's what i want) and
>>> p-values
>>
>>
>>
>> This is not correct. The df are estimated in both cases (i.e. UBRE
>> and GCV), but the scale parameter is fixed in the UBRE case. Hence,
>> by default UBRE is used for family = binomial or poisson since the
>> scale parameter is assumed to be 1. Similarly, GCV is the default for
>> family = gaussian since we most often want the scale (usually denoted
>> sigma^2) to be estimated.
>>
>>
>>> are estimated by an F distribution But in that case they said "use at
>>> your own risk" in ?summary.gam
>>
>>
>>
>> The warning applies in both cases. The p-values are conditional on
>> the smoothing parameters, and the uncertainty of the smooths is not
>> taken into account when computing the p-values.
>>
>>
>>> so you can also look at the chi.sq : but i don't know how to choose a
>>
>>
>>
>> No...
>>
>>
>>> criterion like for p-values... for me, chi.sq show the best
>>> predictor in
>>> a model, but it's hard to reject one with it.
>>
>>
>>
>> Which version of mgcv do you use? The confusion probably stems from
>> earlier versions of mgcv (< 1.3-5): the summary and anova methods
>> used to have a column denoted Chi.sq even when the displayed
>> statistic was computed as F. Recent versions of mgcv has
>>
>> > summary(b)
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>>
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Approximate significance of smooth terms:
>> edf Est.rank F p-value
>> s(x0) 5.173 9.000 3.785 0.000137 ***
>> s(x1) 2.357 9.000 34.631 < 2e-16 ***
>> s(x2) 8.517 9.000 84.694 < 2e-16 ***
>> s(x3) 1.000 1.000 0.444 0.505797
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> R-sq.(adj) = 0.726 Deviance explained = 73.7%
>> GCV score = 4.611 Scale est. = 4.4029 n = 400
>>
>>
>> If we assume that the scale is known and fixed at 4.4029, we get
>>
>> > summary(b, dispersion = 4.4029)
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> y ~ s(x0) + s(x1) + s(x2) + s(x3)
>>
>> Parametric coefficients:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Approximate significance of smooth terms:
>> edf Est.rank Chi.sq p-value
>> s(x0) 5.173 9.000 34.067 8.7e-05 ***
>> s(x1) 2.357 9.000 311.679 < 2e-16 ***
>> s(x2) 8.517 9.000 762.255 < 2e-16 ***
>> s(x3) 1.000 1.000 0.444 0.505
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> R-sq.(adj) = 0.726 Deviance explained = 73.7%
>> GCV score = 4.611 Scale est. = 4.4029 n = 400
>>
>> Note that t/F changed into z/Chi.sq.
>>
>>
>>> so as far as i m concerned, i use GCV methods, and fix a 5% on the null
>>> hypothesis (pvalue) to select significant predictor. after, i look
>>> at my
>>> smooth, and if the parametrization look fine to me, i validate.
>>> generaly, for p-values smaller than 0.001, you can be confident. over
>>> 0.001, you have to check. 2)
>>> for difference between package gam and mgcv, i sent a mail about this
>>
>>
>>
>> The underlying algorithms are very different.
>>
>>
>> HTH,
>> Henric
>>
>>
>> De : "Liaw, Andy" <andy_liaw at merck.com>
>> Date : 28 septembre 2005 14:01:25 HAE
>> À : 'Denis Chabot' <chabotd at globetrotter.net>, Peter Dalgaard
>> <p.dalgaard at biostat.ku.dk>
>> Cc : Thomas Lumley <tlumley at u.washington.edu>, R list <r-
>> help at stat.math.ethz.ch>
>> Objet : RE: [R] p-level in packages mgcv and gam
>>
>> Just change the df in what Thomas described to the degree of
>> polynomial, and
>> everything he said still applies. Any good book on regression that
>> covers
>> polynomial regression ought to point this out.
>>
>> Andy
>>
>>
>>> From: Denis Chabot
>>>
>>> But what about another analogy, that of polynomials? You may not be
>>> sure what degree polynomial to use, and you have not decided before
>>> analysing your data. You fit different polynomials to your data,
>>> checking if added degrees increase r2 sufficiently by doing F-tests.
>>>
>>> I thought it was the same thing with GAMs. You can fit a
>>> model with 4
>>> df, and in some cases it is of interest to see if this is a better
>>> fit than a linear fit. But why can't you also check if 7df is better
>>> than 4df? And if you used mgcv first and it tells you that 7df is
>>> better than 4df, why bother repeating the comparison 7df
>>> against 4df,
>>> why not just take the p-value for the model with 7df (fixed)?
>>>
>>> Denis
>>
>>
>
>
>> De : Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>> Date : 28 septembre 2005 12:04:58 HAE
>> À : Thomas Lumley <tlumley at u.washington.edu>
>> Cc : Denis Chabot <chabotd at globetrotter.net>, R list <r-
>> help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> Thomas Lumley <tlumley at u.washington.edu> writes:
>>
>>>
>>> Bob, on the other hand, chooses the amount of smoothing depending on
>>> the data. When a 4 df smooth fits best he ends up with the same model
>>> as Alice and the same p-value. When some other df fits best he ends
>>> up with a different model and a *smaller* p-value than Alice.
>>
>>
>> This doesn't actually follow, unless the p-value (directly or
>> indirectly) found its way into the definition of "best fit". It does
>> show the danger, though.
>>
>> --
>> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
>> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
>> (*) \(*) -- University of Copenhagen Denmark Ph: (+45)
>> 35327918
>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45)
>> 35327907
>>
>
>> De : Thomas Lumley <tlumley at u.washington.edu>
>> Date : 26 septembre 2005 12:54:43 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : r-help at stat.math.ethz.ch
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Mon, 26 Sep 2005, Denis Chabot wrote:
>>
>> But the mgcv manual warns that p-level for the smooth can be
>> underestimated when df are estimated by the model. Most of the time
>> my p-levels are so small that even doubling them would not result in
>> a value close to the P=0.05 threshold, but I have one case with P=0.033.
>>
>> I thought, probably naively, that running a second model with fixed
>> df, using the value of df found in the first model. I could not
>> achieve this with mgcv: its gam function does not seem to accept
>> fractional values of df (in my case 8.377).
>>
>> No, this won't work. The problem is the usual one with model
>> selection: the p-value is calculated as if the df had been fixed,
>> when really it was estimated.
>>
>> It is likely to be quite hard to get an honest p-value out of
>> something that does adaptive smoothing.
>>
>> -thomas
>> De : Thomas Lumley <tlumley at u.washington.edu>
>> Date : 28 septembre 2005 11:33:27 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : R list <r-help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Wed, 28 Sep 2005, Denis Chabot wrote:
>>
>> I only got one reply to my message:
>>
>> No, this won't work. The problem is the usual one with model
>> selection: the p-value is calculated as if the df had been fixed,
>> when really it was estimated.
>>
>> It is likely to be quite hard to get an honest p-value out of
>> something that does adaptive smoothing.
>>
>> -thomas
>>
>> I do not understand this: it seems that a lot of people chose df=4
>> for no particular reason, but p-levels are correct. If instead I
>> choose df=8 because a previous model has estimated this to be an
>> optimal df, P-levels are no good because df are estimated?
>>
>> Yes. I know this sounds strange initially, but it really does make
>> sense if you think about it carefully.
>>
>> Suppose that Alice and Bob are kyphosis researchers, and that Alice
>> always chooses 4df for smoothing Age. We would all agree that her
>> p-values are correct [in fact we wouldn't, but that is a separate issue]
>>
>> Bob, on the other hand, chooses the amount of smoothing depending on
>> the data. When a 4 df smooth fits best he ends up with the same model
>> as Alice and the same p-value. When some other df fits best he ends
>> up with a different model and a *smaller* p-value than Alice.
>>
>> In particular, this is still true under the null hypothesis that Age
>> has no effect [If Alice and Bob are interested in p-values, the null
>> hypothesis must be plausible.]
>>
>> If Bob's p-values are always less than or equal to Alice's p-values
>> under the null hypothesis, and Alice's p-values are less than 0.05 5%
>> of the time, then Bob's p-values are less than 0.05 more than 5% of
>> the time.
>>
>>
>> -thomas
>>
>>
>> Furthermore, shouldn't packages gam and mgcv give similar results
>> when the same data and df are used? I tried this:
>>
>> library(gam)
>> data(kyphosis)
>> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
>> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
>> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
>> anova.gam(kyp1)
>> anova.gam(kyp2)
>> anova.gam(kyp3)
>>
>> detach(package:gam)
>> library(mgcv)
>> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial,
>> data=kyphosis)
>> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
>> data=kyphosis)
>> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial,
>> data=kyphosis)
>> anova.gam(kyp4)
>> anova.gam(kyp5)
>> anova.gam(kyp6)
>>
>>
>> P levels for these models, by pair
>>
>> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
>> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
>> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)
>>
>> Also if you plot all these you find that the mgcv plots are smoother
>> than the gam plots, even the same df are used all the time.
>>
>> I am really confused now!
>>
>> Denis
>>
>
>>
>> Thomas Lumley Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu University of Washington, SeattleDe :
>> Thomas Lumley <tlumley at u.washington.edu>
>> Date : 28 septembre 2005 14:35:26 HAE
>> À : Denis Chabot <chabotd at globetrotter.net>
>> Cc : Peter Dalgaard <p.dalgaard at biostat.ku.dk>, R list <r-
>> help at stat.math.ethz.ch>
>> Objet : Rép : [R] p-level in packages mgcv and gam
>>
>> On Wed, 28 Sep 2005, Denis Chabot wrote:
>>
>> But what about another analogy, that of polynomials? You may not be
>> sure what degree polynomial to use, and you have not decided before
>> analysing your data. You fit different polynomials to your data,
>> checking if added degrees increase r2 sufficiently by doing F-tests.
>>
>> Yes, you can. And this procedure gives you incorrect p-values.
>>
>> They may not be very incorrect -- it depends on how much model
>> selection you do, and how strongly the feature you are selecting on
>> is related to the one you are testing.
>>
>> For example, using step() to choose a polynomial in x even when x is
>> unrelated to y and z inflates the Type I error rate by giving a
>> biased estimate of the residual mean squared error:
>>
>> once<-function(){
>> y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)
>> summary(step(lm(y~z),
>> scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),
>> trace=0))$coef["z",4]
>> }
>> p<-replicate(1000,once())
>> mean(p<0.05)
>> [1] 0.072
>>
>> which is significantly higher than you would expect for an honest
>> level 0.05 test.
>>
>> -thomas
>
>
More information about the R-help
mailing list