[R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?

Mon Jun 11 10:47:45 CEST 2012

Hi Simon,

Thanks for your reply. That is very helpful.

However, in the logistic regression example, there also appears to be
a large difference in the p-values and estimates when comparing
summary(model) # mgcv 1.7-11
summary(model,freq=F) # mgcv 1.7-17

These should be the same, right? But why is this not the case? The
results are shown below, and are also explained in this post:
http://r.789695.n4.nabble.com/mgcv-bam-very-large-standard-error-difference-between-versions-1-7-11-and-1-7-17-bug-tp4632173p4632491.html)

With kind regards,
Martijn

# call in all versions
model = bam(NoStandard3 ~
te(GeoX,GeoY,WordFreqLog.z,by=OldSpeakerContrast,d=c(2,1)) +
OldSpeakerContrast + PopCntLog.z + s(Word,bs="re") +
s(Placename,bs="re") + s(Word,PopAvgIncomeLog.z,bs="re") +
s(Word,PopAvgAgeLog_residIncome.z,bs="re") +
s(Word,PopCntLog.z,bs="re") +
s(Word,YearRec.z,bs="re"),data=lexdst,family="binomial",method="REML")

### mgcv, version 1.7-11
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
#    d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
#    bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
#    bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
#    s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
#                    Estimate Std. Error z value Pr(>|z|)
#(Intercept)         -0.41610    0.14027  -2.966  0.00301 **
#OldSpeakerContrast1  0.44508    0.01934  23.009  < 2e-16 ***
#PopCntLog.z         -0.11673    0.02725  -4.284 1.83e-05 ***
#---
#Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
#                                                   edf Ref.df  Chi.sq  p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0  54.05  69.06   223.3  < 2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1  59.17  74.83   327.6  < 2e-16
#s(Word)                                         166.41 167.84 13756.0  < 2e-16
#s(Placename)                                    157.64 188.02   692.7  < 2e-16
#s(Word,PopAvgIncomeLog.z)                       137.83 160.05   829.9  < 2e-16
#s(Word,PopAvgAgeLog_residIncome.z)              121.13 151.13   435.5  < 2e-16
#s(Word,PopCntLog.z)                              97.12 133.97   205.5 7.01e-05
#s(Word,YearRec.z)                               128.98 155.27   585.7  < 2e-16
#
#R-sq.(adj) =  0.372   Deviance explained = 32.4%
#REML score =  97450  Scale est. = 1         n = 69259

### mgcv 1.7-17, summary(model,freq=F)
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
#    d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
#    bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
#    bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
#    s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
#                    Estimate Std. Error z value Pr(>|z|)
#(Intercept)         -0.98150    0.52485  -1.870   0.0615 .
#OldSpeakerContrast1  0.65497    0.68628   0.954   0.3399
#PopCntLog.z         -0.11724    0.02726  -4.300 1.71e-05 ***
#---
#Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
#                                                   edf Ref.df  Chi.sq p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0  53.62  68.48   221.4  <2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1  58.64  74.14   324.9  <2e-16
#s(Word)                                         166.41 168.00 14351.7  <2e-16
#s(Placename)                                    157.79 209.00  1160.0  <2e-16
#s(Word,PopAvgIncomeLog.z)                       137.84 170.00  1062.1  <2e-16
#s(Word,PopAvgAgeLog_residIncome.z)              121.15 170.00   686.7  <2e-16
#s(Word,PopCntLog.z)                              97.10 169.00   434.0  <2e-16
#s(Word,YearRec.z)                               128.98 170.00   867.4  <2e-16
#
#R-sq.(adj) =  0.372   Deviance explained = 32.4%
#REML score =  97449  Scale est. = 1         n = 69259

Met vriendelijke groet,
Martijn

--
*******************************************
Martijn Wieling
http://www.martijnwieling.nl
wieling at gmail.com
+31(0)614108622
*******************************************
University of Groningen
http://www.rug.nl/staff/m.b.wieling
*******************************************

On Mon, Jun 11, 2012 at 10:34 AM, Simon Wood <s.wood at bath.ac.uk> wrote:
> Dear Martijn,
>
> You are right that the change in the summary defaults was not a sensible
> thing to do, and I'll change it back. However in the mean time you can use
> 1.7-17 with summary(...,freq=FALSE), when you have random effects in the
> model (it really shouldn't make a statistically meaningful difference
> otherwise).
>
> I made the change because it leads to better p-value performance when
> parametric model component are penalized using gam's `paraPen' argument, but
> as you have demonstrated this then uses a fixed effects distribution that is
> not really appropriate when there are random effects present. I should
> probably just deal with the paraPen case in the paraPen documentation.
>
> Technically the difference here is that with freq=TRUE the covariance matrix
> for the model coefficient estimators/predictions is taken to be
>
>  (X'X+S)^{-1}X'X(X'X+S)^{-1}\sigma^2
>
> where X is the model matrix and S the penalty matrix for the model (Gaussian
> case, non-Gaussian also has weights). This is based on the fact that the
> coefficient estimates/predictions are
>  \hat \beta = (X'X+S)^{-1}X'y
> where y is the response data, and y_i iid N(0,\sigma^2).
>
> When freq=FALSE the covariance matrix is the one that comes from the
> Bayesian/random effects view of the smoothing process, and is
>
>  (X'X+S)^{-1}\sigma^2
>
> The latter is usually what you want when you have true random effects in the
> model (rather than smooths which are usually not in the model as true random
> effects).
>
> best,
> Simon
>
>
>
> On 03/06/12 18:45, Martijn Wieling wrote:
>>
>> Dear useRs,
>>
>> I've ran some additional analyses (see below), which strongly suggest
>> the standard errors of the bam (and gam) function are much too low in
>> mgcv version 1.7-17, at least when including an s(X,bs="re") term.
>> Until this issue has been clarified, it's perhaps best to use an older
>> version of mgcv (unfortunately, however, in earlier versions the
>> p-value calculation of s(X,bs="re") is not correct). All analyses were
>> conducted in R 2.15.0.
>>
>> My approach was the following: I created a mixed-effects regression
>> model with a single random intercept and only linear predictors. In my
>> view, the results using lmer (lme4) should be comparable to those of
>> bam and gam (mgcv). This was the case when using an older version of
>> mgcv (version 1.7-13), but this is not the case anymore in version
>> 1.7-17. In version 1.7-17, the standard errors and p-values are much
>> lower and very similar to those of a linear model (which does not take
>> the random-effects structure into account). The R-code and results are
>> shown below. (The results using gam are not shown, but show the same
>> pattern.)
>>
>> Furthermore, note that the differences in standard errors become less
>> severe (but still noticeable) when less data is involved (e.g., using
>> only 500 rows as opposed to>100.000). Finally, when not including an
>> s(X,bs="re") term, but another non-random-effect smooth, the standard
>> errors do not appear to be structurally lower (only for some
>> variables, but not by a great deal - see also below).
>>
>> With kind regards,
>> Martijn Wieling
>> University of Groningen
>>
>> #### lme4 model (most recent version of lme4)
>> modelLMER<- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + (1|Key), data=wrddst)
>> #                        Estimate Std. Error t value
>> #SpYearBirth.z          -0.012084   0.004577  -2.640
>> #IsAragon                0.138959   0.010040  13.840
>> #SpIsMale               -0.003087   0.008290  -0.372
>> #SpYearBirth.z:IsAragon  0.015429   0.010159   1.519
>>
>>
>> #### mgcv 1.7-13, default (method = "REML") - almost identical to
>> modelLMER
>> modelBAMold<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst)
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #SpYearBirth.z          -0.012084   0.004578  -2.640  0.00829 **
>> #IsAragon                0.138959   0.010042  13.838<  2e-16 ***
>> #SpIsMale               -0.003087   0.008292  -0.372  0.70968
>> #SpYearBirth.z:IsAragon  0.015429   0.010160   1.519  0.12886
>>
>>
>> #### mgcv 1.7-17, method = "REML" - standard errors greatly reduced
>> # (comparable to standard errors of LM without random intercept)
>> modelBAMnew<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #SpYearBirth.z          -0.012084   0.001159 -10.428<  2e-16 ***
>> #IsAragon                0.138959   0.002551  54.472<  2e-16 ***
>> #SpIsMale               -0.003087   0.002098  -1.471    0.141
>> #SpYearBirth.z:IsAragon  0.015429   0.002587   5.965 2.45e-09 ***
>>
>> #### lm results, standard errors comparable to mgcv 1.7-17
>> modelLM<- lm(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon + SpIsMale,
>> data=wrddst)
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #(Intercept)            -0.025779   0.001653 -15.595<  2e-16 ***
>> #SpYearBirth.z          -0.011906   0.001182 -10.070<  2e-16 ***
>> #IsAragon                0.139323   0.002603  53.531<  2e-16 ***
>> #SpIsMale               -0.003076   0.002140  -1.437    0.151
>> #SpYearBirth.z:IsAragon  0.015252   0.002639   5.780 7.49e-09 ***
>>
>>
>> #### mgcv 1.7-17, default (method = "fREML") - completely different
>> from previous models
>> modelBAMfREML<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
>> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #(Intercept)            -0.025391   0.106897  -0.238    0.812
>> #SpYearBirth.z          -0.012084   0.076300  -0.158    0.874
>> #IsAragon                0.138959   0.166697   0.834    0.405
>> #SpIsMale               -0.003087   0.138291  -0.022    0.982
>> #SpYearBirth.z:IsAragon  0.015429   0.168260   0.092    0.927
>> #
>> #Approximate significance of smooth terms:
>> #          edf Ref.df     F p-value
>> #s(Key) -38.95    310 15.67<2e-16 ***
>>
>>
>> #### differences w.r.t. standard smooths
>> #### mgcv version 1.7-13
>> m2old<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
>> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
>> ## RESULTS
>> #Family: gaussian
>> #Link function: identity
>> #
>> #Formula:
>> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
>> #    SpIsMale
>> #
>> #Parametric coefficients:
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #(Intercept)            -0.001386   0.004982  -0.278   0.7809
>> #SpYearBirth.z          -0.012950   0.001167 -11.097<  2e-16 ***
>> #IsAragon                0.020532   0.023608   0.870   0.3845
>> #SpIsMale               -0.004788   0.002219  -2.158   0.0309 *
>> #SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
>> #---
>> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #Approximate significance of smooth terms:
>> #               edf Ref.df     F p-value
>> #s(GeoX,GeoY) 27.11  28.14 126.2<2e-16 ***
>> #---
>> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #R-sq.(adj) =  0.0555   Deviance explained = 5.58%
>> #REML score =  39232  Scale est. = 0.11734   n = 112608
>>
>>
>> #### mgcv version 1.7-17
>> m2new<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
>> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
>> #Family: gaussian
>> #Link function: identity
>> #
>> #Formula:
>> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
>> #    SpIsMale
>> #
>> #Parametric coefficients:
>> #                        Estimate Std. Error t value Pr(>|t|)
>> #(Intercept)            -0.001388   0.003938  -0.352   0.7245
>> #SpYearBirth.z          -0.012950   0.001167 -11.098<  2e-16 ***
>> #IsAragon                0.020543   0.018055   1.138   0.2552
>> #SpIsMale               -0.004788   0.002215  -2.161   0.0307 *
>> #SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
>> #---
>> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #Approximate significance of smooth terms:
>> #               edf Ref.df     F p-value
>> #s(GeoX,GeoY) 27.11  28.14 126.2<2e-16 ***
>> #---
>> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> #
>> #R-sq.(adj) =  0.0555   Deviance explained = 5.58%
>> #REML score =  39232  Scale est. = 0.11734   n = 112608
>>
>>
>> On Sat, Jun 2, 2012 at 6:25 PM, Martijn Wieling<wieling at gmail.com>  wrote:
>>>
>>> Dear useRs,
>>>
>>> I reran an analysis with bam (mgcv, version 1.7-17) originally
>>> conducted using an older version of bam (mgcv, version 1.7-11) and
>>> this resulted in the same estimates, but much lower standard errors
>>> (in some cases 20 times as low) and lower p-values. This obviously
>>> results in a larger set of significant predictors. Is this result
>>> expected given the improvements in the new version? Or this a bug and
>>> are the p-values of bam in mgcv 1.7-17 too low? The summaries of both
>>> versions are shown below to enable a comparison.
>>>
>>> In addition, applying the default method="fREML" (mgcv version 1.7-17)
>>> on the same dataset yields only non-significant results, while all
>>> results are highly significant using method="REML". Furthermore, it
>>> also results in large negative (e.g., -8757) edf values linked to
>>> s(X,bs="RE") terms. Is this correct, or is this a bug? The summary of
>>> the model using method="fREML" is also shown below.
>>>
>>> I hope someone can shed some light on this.
>>>
>>> With kind regards,
>>> Martijn Wieling,
>>> University of Groningen
>>>
>>> #################################
>>> ### mgcv version 1.7-11
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>>    s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>>                           Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)               -0.099757   0.020234  -4.930 8.23e-07 ***
>>> RefVratio.z                0.105705   0.013328   7.931 2.19e-15 ***
>>> IsSemiwordOrDemonstrative  0.289828   0.046413   6.245 4.27e-10 ***
>>> RefSoundCnt.z              0.119981   0.021202   5.659 1.53e-08 ***
>>> SpYearBirth.z             -0.011396   0.002407  -4.734 2.21e-06 ***
>>> IsAragon                   0.055678   0.033137   1.680  0.09291 .
>>> PopCntLog_residGeo.z      -0.006504   0.003279  -1.984  0.04731 *
>>> SpYearBirth.z:IsAragon     0.015871   0.005459   2.907  0.00365 **
>>> ---
>>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> Approximate significance of smooth terms:
>>>                edf Ref.df      F p-value
>>> s(GeoX,GeoY)  24.01  24.21  31.16<2e-16 ***
>>> s(Word)      352.29 347.00 501.57<2e-16 ***
>>> s(Key)       269.75 289.25  10.76<2e-16 ***
>>> ---
>>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) =  0.693   Deviance explained = 69.4%
>>> REML score = -22476  Scale est. = 0.038177  n = 112608
>>>
>>>
>>> #################################
>>> ### mgcv version 1.7-17, much lower p-values and standard errors than
>>> version 1.7-11
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>>    s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>>                            Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)               -0.0997566  0.0014139 -70.552<  2e-16 ***
>>> RefVratio.z                0.1057049  0.0006565 161.010<  2e-16 ***
>>> IsSemiwordOrDemonstrative  0.2898285  0.0020388 142.155<  2e-16 ***
>>> RefSoundCnt.z              0.1199813  0.0009381 127.905<  2e-16 ***
>>> SpYearBirth.z             -0.0113956  0.0006508 -17.509<  2e-16 ***
>>> IsAragon                   0.0556777  0.0057143   9.744<  2e-16 ***
>>> PopCntLog_residGeo.z      -0.0065037  0.0007938  -8.193 2.58e-16 ***
>>> SpYearBirth.z:IsAragon     0.0158712  0.0014829  10.703<  2e-16 ***
>>> ---
>>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> Approximate significance of smooth terms:
>>>                edf Ref.df       F p-value
>>> s(GeoX,GeoY)  24.01  24.21   31.15<2e-16 ***
>>> s(Word)      352.29 347.00  587.26<2e-16 ***
>>> s(Key)       269.75 313.00 4246.76<2e-16 ***
>>> ---
>>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) =  0.693   Deviance explained = 69.4%
>>> REML score = -22476  Scale est. = 0.038177  n = 112608
>>>
>>>
>>> #################################
>>> ### mgcv version 1.7-17, default: method="fREML", all p-values
>>> non-significant and negative edf's of s(X,bs="re")
>>> #################################
>>>
>>> Family: gaussian
>>> Link function: identity
>>>
>>> Formula:
>>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
>>> IsSemiwordOrDemonstrative +
>>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>>    s(Word, bs = "re") + s(Key, bs = "re")
>>>
>>> Parametric coefficients:
>>>                           Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)               -0.099757   1.730235  -0.058    0.954
>>> RefVratio.z                0.105705   1.145329   0.092    0.926
>>> IsSemiwordOrDemonstrative  0.289828   4.167237   0.070    0.945
>>> RefSoundCnt.z              0.119981   1.901158   0.063    0.950
>>> SpYearBirth.z             -0.011396   0.034236  -0.333    0.739
>>> IsAragon                   0.055678   0.298629   0.186    0.852
>>> PopCntLog_residGeo.z      -0.006504   0.041981  -0.155    0.877
>>> SpYearBirth.z:IsAragon     0.015871   0.077142   0.206    0.837
>>>
>>> Approximate significance of smooth terms:
>>>               edf Ref.df       F p-value
>>> s(GeoX,GeoY) -1376      1   7.823 0.00516 **
>>> s(Word)      -8298    347 577.910<  2e-16 ***
>>> s(Key)       -1421    316  13.512<  2e-16 ***
>>> ---
>>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>>
>>> R-sq.(adj) =  0.741   Deviance explained = 69.4%
>>> fREML score = -22476  Scale est. = 0.038177  n = 112608
>>
>>
>>
>
>
> --
> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
> +44 (0)1225 386603               http://people.bath.ac.uk/sw283