[R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?
Martijn Wieling
wieling at gmail.com
Mon Jun 11 10:47:45 CEST 2012
Hi Simon,
Thanks for your reply. That is very helpful.
However, in the logistic regression example, there also appears to be
a large difference in the p-values and estimates when comparing
summary(model) # mgcv 1.7-11
summary(model,freq=F) # mgcv 1.7-17
These should be the same, right? But why is this not the case? The
results are shown below, and are also explained in this post:
http://r.789695.n4.nabble.com/mgcv-bam-very-large-standard-error-difference-between-versions-1-7-11-and-1-7-17-bug-tp4632173p4632491.html)
With kind regards,
Martijn
# call in all versions
model = bam(NoStandard3 ~
te(GeoX,GeoY,WordFreqLog.z,by=OldSpeakerContrast,d=c(2,1)) +
OldSpeakerContrast + PopCntLog.z + s(Word,bs="re") +
s(Placename,bs="re") + s(Word,PopAvgIncomeLog.z,bs="re") +
s(Word,PopAvgAgeLog_residIncome.z,bs="re") +
s(Word,PopCntLog.z,bs="re") +
s(Word,YearRec.z,bs="re"),data=lexdst,family="binomial",method="REML")
### mgcv, version 1.7-11
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
# d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
# bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
# bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
# s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.41610 0.14027 -2.966 0.00301 **
#OldSpeakerContrast1 0.44508 0.01934 23.009 < 2e-16 ***
#PopCntLog.z -0.11673 0.02725 -4.284 1.83e-05 ***
#---
#Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
# edf Ref.df Chi.sq p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0 54.05 69.06 223.3 < 2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1 59.17 74.83 327.6 < 2e-16
#s(Word) 166.41 167.84 13756.0 < 2e-16
#s(Placename) 157.64 188.02 692.7 < 2e-16
#s(Word,PopAvgIncomeLog.z) 137.83 160.05 829.9 < 2e-16
#s(Word,PopAvgAgeLog_residIncome.z) 121.13 151.13 435.5 < 2e-16
#s(Word,PopCntLog.z) 97.12 133.97 205.5 7.01e-05
#s(Word,YearRec.z) 128.98 155.27 585.7 < 2e-16
#
#R-sq.(adj) = 0.372 Deviance explained = 32.4%
#REML score = 97450 Scale est. = 1 n = 69259
### mgcv 1.7-17, summary(model,freq=F)
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
# d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
# bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
# bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
# s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.98150 0.52485 -1.870 0.0615 .
#OldSpeakerContrast1 0.65497 0.68628 0.954 0.3399
#PopCntLog.z -0.11724 0.02726 -4.300 1.71e-05 ***
#---
#Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
# edf Ref.df Chi.sq p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0 53.62 68.48 221.4 <2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1 58.64 74.14 324.9 <2e-16
#s(Word) 166.41 168.00 14351.7 <2e-16
#s(Placename) 157.79 209.00 1160.0 <2e-16
#s(Word,PopAvgIncomeLog.z) 137.84 170.00 1062.1 <2e-16
#s(Word,PopAvgAgeLog_residIncome.z) 121.15 170.00 686.7 <2e-16
#s(Word,PopCntLog.z) 97.10 169.00 434.0 <2e-16
#s(Word,YearRec.z) 128.98 170.00 867.4 <2e-16
#
#R-sq.(adj) = 0.372 Deviance explained = 32.4%
#REML score = 97449 Scale est. = 1 n = 69259
Met vriendelijke groet,
Martijn
--
*******************************************
Martijn Wieling
http://www.martijnwieling.nl
wieling at gmail.com
+31(0)614108622
*******************************************
University of Groningen
http://www.rug.nl/staff/m.b.wieling
*******************************************
On Mon, Jun 11, 2012 at 10:34 AM, Simon Wood <s.wood at bath.ac.uk> wrote:
> Dear Martijn,
>
> You are right that the change in the summary defaults was not a sensible
> thing to do, and I'll change it back. However in the mean time you can use
> 1.7-17 with summary(...,freq=FALSE), when you have random effects in the
> model (it really shouldn't make a statistically meaningful difference
> otherwise).
>
> I made the change because it leads to better p-value performance when
> parametric model component are penalized using gam's `paraPen' argument, but
> as you have demonstrated this then uses a fixed effects distribution that is
> not really appropriate when there are random effects present. I should
> probably just deal with the paraPen case in the paraPen documentation.
>
> Technically the difference here is that with freq=TRUE the covariance matrix
> for the model coefficient estimators/predictions is taken to be
>
> (X'X+S)^{-1}X'X(X'X+S)^{-1}\sigma^2
>
> where X is the model matrix and S the penalty matrix for the model (Gaussian
> case, non-Gaussian also has weights). This is based on the fact that the
> coefficient estimates/predictions are
> \hat \beta = (X'X+S)^{-1}X'y
> where y is the response data, and y_i iid N(0,\sigma^2).
>
> When freq=FALSE the covariance matrix is the one that comes from the
> Bayesian/random effects view of the smoothing process, and is
>
> (X'X+S)^{-1}\sigma^2
>
> The latter is usually what you want when you have true random effects in the
> model (rather than smooths which are usually not in the model as true random
> effects).
>
> best,
> Simon
>
>
>
> On 03/06/12 18:45, Martijn Wieling wrote:
>>
>> Dear useRs,
>>
>> I've ran some additional analyses (see below), which strongly suggest
>> the standard errors of the bam (and gam) function are much too low in
>> mgcv version 1.7-17, at least when including an s(X,bs="re") term.
>> Until this issue has been clarified, it's perhaps best to use an older
>> version of mgcv (unfortunately, however, in earlier versions the
>> p-value calculation of s(X,bs="re") is not correct). All analyses were
>> conducted in R 2.15.0.
>>
>> My approach was the following: I created a mixed-effects regression
>> model with a single random intercept and only linear predictors. In my
>> view, the results using lmer (lme4) should be comparable to those of
>> bam and gam (mgcv). This was the case when using an older version of
>> mgcv (version 1.7-13), but this is not the case anymore in version
>> 1.7-17. In version 1.7-17, the standard errors and p-values are much
>> lower and very similar to those of a linear model (which does not take
>> the random-effects structure into account). The R-code and results are
>> shown below. (The results using gam are not shown, but show the same
>> pattern.)
>>
>> Furthermore, note that the differences in standard errors become less
>> severe (but still noticeable) when less data is involved (e.g., using
>> only 500 rows as opposed to>100.000). Finally, when not including an
>> s(X,bs="re") term, but another non-random-effect smooth, the standard
>> errors do not appear to be structurally lower (only for some
>> variables, but not by a great deal - see also below).
>>
>> With kind regards,
>> Martijn Wieling
>> University of Groningen
>>
>> #### lme4 model (most recent version of lme4)
>> modelLMER<- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + (1|Key), data=wrddst)
>> # Estimate Std. Error t value
>> #SpYearBirth.z -0.012084 0.004577 -2.640
>> #IsAragon 0.138959 0.010040 13.840
>> #SpIsMale -0.003087 0.008290 -0.372
>> #SpYearBirth.z:IsAragon 0.015429 0.010159 1.519
>>
>>
>> #### mgcv 1.7-13, default (method = "REML") - almost identical to
>> modelLMER
>> modelBAMold<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst)
>> # Estimate Std. Error t value Pr(>|t|)
>> #SpYearBirth.z -0.012084 0.004578 -2.640 0.00829 **
>> #IsAragon 0.138959 0.010042 13.838< 2e-16 ***
>> #SpIsMale -0.003087 0.008292 -0.372 0.70968
>> #SpYearBirth.z:IsAragon 0.015429 0.010160 1.519 0.12886
>>
>>
>> #### mgcv 1.7-17, method = "REML" - standard errors greatly reduced
>> # (comparable to standard errors of LM without random intercept)
>> modelBAMnew<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
>> # Estimate Std. Error t value Pr(>|t|)
>> #SpYearBirth.z -0.012084 0.001159 -10.428< 2e-16 ***
>> #IsAragon 0.138959 0.002551 54.472< 2e-16 ***
>> #SpIsMale -0.003087 0.002098 -1.471 0.141
>> #SpYearBirth.z:IsAragon 0.015429 0.002587 5.965 2.45e-09 ***
>>
>> #### lm results, standard errors comparable to mgcv 1.7-17
>> modelLM<- lm(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon + SpIsMale,
>> data=wrddst)
>> # Estimate Std. Error t value Pr(>|t|)
>> #(Intercept) -0.025779 0.001653 -15.595< 2e-16 ***
>> #SpYearBirth.z -0.011906 0.001182 -10.070< 2e-16 ***
>> #IsAragon 0.139323 0.002603 53.531< 2e-16 ***
>> #SpIsMale -0.003076 0.002140 -1.437 0.151
>> #SpYearBirth.z:IsAragon 0.015252 0.002639 5.780 7.49e-09 ***
>>
>>
>> #### mgcv 1.7-17, default (method = "fREML") - completely different
>> from previous models
>> modelBAMfREML<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
>> # Estimate Std. Error t value Pr(>|t|)
>> #(Intercept) -0.025391 0.106897 -0.238 0.812
>> #SpYearBirth.z -0.012084 0.076300 -0.158 0.874
>> #IsAragon 0.138959 0.166697 0.834 0.405
>> #SpIsMale -0.003087 0.138291 -0.022 0.982
>> #SpYearBirth.z:IsAragon 0.015429 0.168260 0.092 0.927
>> #
>> #Approximate significance of smooth terms:
>> # edf Ref.df F p-value
>> #s(Key) -38.95 310 15.67<2e-16 ***
>>
>>
>> #### differences w.r.t. standard smooths
>> #### mgcv version 1.7-13
>> m2old<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
>> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
>> ## RESULTS
>> #Family: gaussian
>> #Link function: identity
>> #
>> #Formula:
>> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
>> # SpIsMale
>> #
>> #Parametric coefficients:
>> # Estimate Std. Error t value Pr(>|t|)
>> #(Intercept) -0.001386 0.004982 -0.278 0.7809
>> #SpYearBirth.z -0.012950 0.001167 -11.097< 2e-16 ***
>> #IsAragon 0.020532 0.023608 0.870 0.3845
>> #SpIsMale -0.004788 0.002219 -2.158 0.0309 *
>> #SpYearBirth.z:IsAragon 0.015611 0.002600 6.005 1.92e-09 ***
>> #---
>> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #Approximate significance of smooth terms:
>> # edf Ref.df F p-value
>> #s(GeoX,GeoY) 27.11 28.14 126.2<2e-16 ***
>> #---
>> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #R-sq.(adj) = 0.0555 Deviance explained = 5.58%
>> #REML score = 39232 Scale est. = 0.11734 n = 112608
>>
>>
>> #### mgcv version 1.7-17
>> m2new<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
>> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
>> #Family: gaussian
>> #Link function: identity
>> #
>> #Formula:
>> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
>> # SpIsMale
>> #
>> #Parametric coefficients:
>> # Estimate Std. Error t value Pr(>|t|)
>> #(Intercept) -0.001388 0.003938 -0.352 0.7245
>> #SpYearBirth.z -0.012950 0.001167 -11.098< 2e-16 ***
>> #IsAragon 0.020543 0.018055 1.138 0.2552
>> #SpIsMale -0.004788 0.002215 -2.161 0.0307 *
>> #SpYearBirth.z:IsAragon 0.015611 0.002600 6.005 1.92e-09 ***
>> #---
>> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #Approximate significance of smooth terms:
>> # edf Ref.df F p-value
>> #s(GeoX,GeoY) 27.11 28.14 126.2<2e-16 ***
>> #---
>> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #R-sq.(adj) = 0.0555 Deviance explained = 5.58%
>> #REML score = 39232 Scale est. = 0.11734 n = 112608
>>
>>
>> On Sat, Jun 2, 2012 at 6:25 PM, Martijn Wieling<wieling at gmail.com> wrote:
>>>
>>> Dear useRs,
>>>
>>> I reran an analysis with bam (mgcv, version 1.7-17) originally
>>> conducted using an older version of bam (mgcv, version 1.7-11) and
>>> this resulted in the same estimates, but much lower standard errors
>>> (in some cases 20 times as low) and lower p-values. This obviously
>>> results in a larger set of significant predictors. Is this result
>>> expected given the improvements in the new version? Or this a bug and
>>> are the p-values of bam in mgcv 1.7-17 too low? The summaries of both
>>> versions are shown below to enable a comparison.
>>>
>>> In addition, applying the default method="fREML" (mgcv version 1.7-17)
>>> on the same dataset yields only non-significant results, while all
>>> results are highly significant using method="REML". Furthermore, it
>>> also results in large negative (e.g., -8757) edf values linked to
>>> s(X,bs="RE") terms. Is this correct, or is this a bug? The summary of
>>> the model using method="fREML" is also shown below.
>>>
>>> I hope someone can shed some light on this.
>>>
>>> With kind regards,
>>> Martijn Wieling,
>>> University of Groningen
>>>
>>> #################################
>>> ### mgcv version 1.7-11
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>> s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) -0.099757 0.020234 -4.930 8.23e-07 ***
>>> RefVratio.z 0.105705 0.013328 7.931 2.19e-15 ***
>>> IsSemiwordOrDemonstrative 0.289828 0.046413 6.245 4.27e-10 ***
>>> RefSoundCnt.z 0.119981 0.021202 5.659 1.53e-08 ***
>>> SpYearBirth.z -0.011396 0.002407 -4.734 2.21e-06 ***
>>> IsAragon 0.055678 0.033137 1.680 0.09291 .
>>> PopCntLog_residGeo.z -0.006504 0.003279 -1.984 0.04731 *
>>> SpYearBirth.z:IsAragon 0.015871 0.005459 2.907 0.00365 **
>>> ---
>>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> Approximate significance of smooth terms:
>>> edf Ref.df F p-value
>>> s(GeoX,GeoY) 24.01 24.21 31.16<2e-16 ***
>>> s(Word) 352.29 347.00 501.57<2e-16 ***
>>> s(Key) 269.75 289.25 10.76<2e-16 ***
>>> ---
>>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) = 0.693 Deviance explained = 69.4%
>>> REML score = -22476 Scale est. = 0.038177 n = 112608
>>>
>>>
>>> #################################
>>> ### mgcv version 1.7-17, much lower p-values and standard errors than
>>> version 1.7-11
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>> s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) -0.0997566 0.0014139 -70.552< 2e-16 ***
>>> RefVratio.z 0.1057049 0.0006565 161.010< 2e-16 ***
>>> IsSemiwordOrDemonstrative 0.2898285 0.0020388 142.155< 2e-16 ***
>>> RefSoundCnt.z 0.1199813 0.0009381 127.905< 2e-16 ***
>>> SpYearBirth.z -0.0113956 0.0006508 -17.509< 2e-16 ***
>>> IsAragon 0.0556777 0.0057143 9.744< 2e-16 ***
>>> PopCntLog_residGeo.z -0.0065037 0.0007938 -8.193 2.58e-16 ***
>>> SpYearBirth.z:IsAragon 0.0158712 0.0014829 10.703< 2e-16 ***
>>> ---
>>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> Approximate significance of smooth terms:
>>> edf Ref.df F p-value
>>> s(GeoX,GeoY) 24.01 24.21 31.15<2e-16 ***
>>> s(Word) 352.29 347.00 587.26<2e-16 ***
>>> s(Key) 269.75 313.00 4246.76<2e-16 ***
>>> ---
>>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) = 0.693 Deviance explained = 69.4%
>>> REML score = -22476 Scale est. = 0.038177 n = 112608
>>>
>>>
>>> #################################
>>> ### mgcv version 1.7-17, default: method="fREML", all p-values
>>> non-significant and negative edf's of s(X,bs="re")
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>> s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) -0.099757 1.730235 -0.058 0.954
>>> RefVratio.z 0.105705 1.145329 0.092 0.926
>>> IsSemiwordOrDemonstrative 0.289828 4.167237 0.070 0.945
>>> RefSoundCnt.z 0.119981 1.901158 0.063 0.950
>>> SpYearBirth.z -0.011396 0.034236 -0.333 0.739
>>> IsAragon 0.055678 0.298629 0.186 0.852
>>> PopCntLog_residGeo.z -0.006504 0.041981 -0.155 0.877
>>> SpYearBirth.z:IsAragon 0.015871 0.077142 0.206 0.837
>>>
>>> Approximate significance of smooth terms:
>>> edf Ref.df F p-value
>>> s(GeoX,GeoY) -1376 1 7.823 0.00516 **
>>> s(Word) -8298 347 577.910< 2e-16 ***
>>> s(Key) -1421 316 13.512< 2e-16 ***
>>> ---
>>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) = 0.741 Deviance explained = 69.4%
>>> fREML score = -22476 Scale est. = 0.038177 n = 112608
>>
>>
>>
>
>
> --
> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
> +44 (0)1225 386603 http://people.bath.ac.uk/sw283
More information about the R-help
mailing list