[R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?
Martijn Wieling
wieling at gmail.com
Tue Jun 5 11:06:28 CEST 2012
Dear useRs, Simon,
@Simon: I'll send the data offline.
The command summary(...,freq=F) yields the same result as before. Why
did the default change from freq=F in version 1.7-13 to freq=T in
version 1.7-17? Especially since the original default appeared to be
better.
I noticed that some of my commands were slightly off in the R-example
of my previous e-mail (cut-paste errors...). They should be:
modelLMER <- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + (1|Key), data=wrddst)
print(modelLMER,cor=F)
# using version 1.7-13, default = "REML"
modelBAMold <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst)
summary(modelBAMold)
# using version 1.7-17, explicitly stating method="REML"
modelBAMnew <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst, method="REML")
summary(modelBAMnew)
summary(modelBAMnew,freq=F) # same results as modelBAMold
# using version 1.7-17, default: fREML - contains a bug
modelBAMfREML <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst)
summary(modelBAMfREML)
With kind regards,
Martijn Wieling
--
*******************************************
Martijn Wieling
http://www.martijnwieling.nl
wieling at gmail.com
+31(0)614108622
*******************************************
University of Groningen
http://www.rug.nl/staff/m.b.wieling
*******************************************
On Sun, Jun 3, 2012 at 7:45 PM, Martijn Wieling <wieling at gmail.com> wrote:
> Dear useRs,
>
> I've ran some additional analyses (see below), which strongly suggest
> the standard errors of the bam (and gam) function are much too low in
> mgcv version 1.7-17, at least when including an s(X,bs="re") term.
> Until this issue has been clarified, it's perhaps best to use an older
> version of mgcv (unfortunately, however, in earlier versions the
> p-value calculation of s(X,bs="re") is not correct). All analyses were
> conducted in R 2.15.0.
>
> My approach was the following: I created a mixed-effects regression
> model with a single random intercept and only linear predictors. In my
> view, the results using lmer (lme4) should be comparable to those of
> bam and gam (mgcv). This was the case when using an older version of
> mgcv (version 1.7-13), but this is not the case anymore in version
> 1.7-17. In version 1.7-17, the standard errors and p-values are much
> lower and very similar to those of a linear model (which does not take
> the random-effects structure into account). The R-code and results are
> shown below. (The results using gam are not shown, but show the same
> pattern.)
>
> Furthermore, note that the differences in standard errors become less
> severe (but still noticeable) when less data is involved (e.g., using
> only 500 rows as opposed to >100.000). Finally, when not including an
> s(X,bs="re") term, but another non-random-effect smooth, the standard
> errors do not appear to be structurally lower (only for some
> variables, but not by a great deal - see also below).
>
> With kind regards,
> Martijn Wieling
> University of Groningen
>
> #### lme4 model (most recent version of lme4)
> modelLMER <- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + (1|Key), data=wrddst)
> # Estimate Std. Error t value
> #SpYearBirth.z -0.012084 0.004577 -2.640
> #IsAragon 0.138959 0.010040 13.840
> #SpIsMale -0.003087 0.008290 -0.372
> #SpYearBirth.z:IsAragon 0.015429 0.010159 1.519
>
>
> #### mgcv 1.7-13, default (method = "REML") - almost identical to modelLMER
> modelBAMold <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst)
> # Estimate Std. Error t value Pr(>|t|)
> #SpYearBirth.z -0.012084 0.004578 -2.640 0.00829 **
> #IsAragon 0.138959 0.010042 13.838 < 2e-16 ***
> #SpIsMale -0.003087 0.008292 -0.372 0.70968
> #SpYearBirth.z:IsAragon 0.015429 0.010160 1.519 0.12886
>
>
> #### mgcv 1.7-17, method = "REML" - standard errors greatly reduced
> # (comparable to standard errors of LM without random intercept)
> modelBAMnew <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
> # Estimate Std. Error t value Pr(>|t|)
> #SpYearBirth.z -0.012084 0.001159 -10.428 < 2e-16 ***
> #IsAragon 0.138959 0.002551 54.472 < 2e-16 ***
> #SpIsMale -0.003087 0.002098 -1.471 0.141
> #SpYearBirth.z:IsAragon 0.015429 0.002587 5.965 2.45e-09 ***
>
> #### lm results, standard errors comparable to mgcv 1.7-17
> modelLM <- lm(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon + SpIsMale,
> data=wrddst)
> # Estimate Std. Error t value Pr(>|t|)
> #(Intercept) -0.025779 0.001653 -15.595 < 2e-16 ***
> #SpYearBirth.z -0.011906 0.001182 -10.070 < 2e-16 ***
> #IsAragon 0.139323 0.002603 53.531 < 2e-16 ***
> #SpIsMale -0.003076 0.002140 -1.437 0.151
> #SpYearBirth.z:IsAragon 0.015252 0.002639 5.780 7.49e-09 ***
>
>
> #### mgcv 1.7-17, default (method = "fREML") - completely different
> from previous models
> modelBAMfREML <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
> # Estimate Std. Error t value Pr(>|t|)
> #(Intercept) -0.025391 0.106897 -0.238 0.812
> #SpYearBirth.z -0.012084 0.076300 -0.158 0.874
> #IsAragon 0.138959 0.166697 0.834 0.405
> #SpIsMale -0.003087 0.138291 -0.022 0.982
> #SpYearBirth.z:IsAragon 0.015429 0.168260 0.092 0.927
> #
> #Approximate significance of smooth terms:
> # edf Ref.df F p-value
> #s(Key) -38.95 310 15.67 <2e-16 ***
>
>
> #### differences w.r.t. standard smooths
> #### mgcv version 1.7-13
> m2old <- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
> ## RESULTS
> #Family: gaussian
> #Link function: identity
> #
> #Formula:
> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
> # SpIsMale
> #
> #Parametric coefficients:
> # Estimate Std. Error t value Pr(>|t|)
> #(Intercept) -0.001386 0.004982 -0.278 0.7809
> #SpYearBirth.z -0.012950 0.001167 -11.097 < 2e-16 ***
> #IsAragon 0.020532 0.023608 0.870 0.3845
> #SpIsMale -0.004788 0.002219 -2.158 0.0309 *
> #SpYearBirth.z:IsAragon 0.015611 0.002600 6.005 1.92e-09 ***
> #---
> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #Approximate significance of smooth terms:
> # edf Ref.df F p-value
> #s(GeoX,GeoY) 27.11 28.14 126.2 <2e-16 ***
> #---
> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #R-sq.(adj) = 0.0555 Deviance explained = 5.58%
> #REML score = 39232 Scale est. = 0.11734 n = 112608
>
>
> #### mgcv version 1.7-17
> m2new <- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
> #Family: gaussian
> #Link function: identity
> #
> #Formula:
> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
> # SpIsMale
> #
> #Parametric coefficients:
> # Estimate Std. Error t value Pr(>|t|)
> #(Intercept) -0.001388 0.003938 -0.352 0.7245
> #SpYearBirth.z -0.012950 0.001167 -11.098 < 2e-16 ***
> #IsAragon 0.020543 0.018055 1.138 0.2552
> #SpIsMale -0.004788 0.002215 -2.161 0.0307 *
> #SpYearBirth.z:IsAragon 0.015611 0.002600 6.005 1.92e-09 ***
> #---
> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #Approximate significance of smooth terms:
> # edf Ref.df F p-value
> #s(GeoX,GeoY) 27.11 28.14 126.2 <2e-16 ***
> #---
> #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #R-sq.(adj) = 0.0555 Deviance explained = 5.58%
> #REML score = 39232 Scale est. = 0.11734 n = 112608
>
>
> On Sat, Jun 2, 2012 at 6:25 PM, Martijn Wieling <wieling at gmail.com> wrote:
>> Dear useRs,
>>
>> I reran an analysis with bam (mgcv, version 1.7-17) originally
>> conducted using an older version of bam (mgcv, version 1.7-11) and
>> this resulted in the same estimates, but much lower standard errors
>> (in some cases 20 times as low) and lower p-values. This obviously
>> results in a larger set of significant predictors. Is this result
>> expected given the improvements in the new version? Or this a bug and
>> are the p-values of bam in mgcv 1.7-17 too low? The summaries of both
>> versions are shown below to enable a comparison.
>>
>> In addition, applying the default method="fREML" (mgcv version 1.7-17)
>> on the same dataset yields only non-significant results, while all
>> results are highly significant using method="REML". Furthermore, it
>> also results in large negative (e.g., -8757) edf values linked to
>> s(X,bs="RE") terms. Is this correct, or is this a bug? The summary of
>> the model using method="fREML" is also shown below.
>>
>> I hope someone can shed some light on this.
>>
>> With kind regards,
>> Martijn Wieling,
>> University of Groningen
>>
>> #################################
>> ### mgcv version 1.7-11
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>> s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) -0.099757 0.020234 -4.930 8.23e-07 ***
>> RefVratio.z 0.105705 0.013328 7.931 2.19e-15 ***
>> IsSemiwordOrDemonstrative 0.289828 0.046413 6.245 4.27e-10 ***
>> RefSoundCnt.z 0.119981 0.021202 5.659 1.53e-08 ***
>> SpYearBirth.z -0.011396 0.002407 -4.734 2.21e-06 ***
>> IsAragon 0.055678 0.033137 1.680 0.09291 .
>> PopCntLog_residGeo.z -0.006504 0.003279 -1.984 0.04731 *
>> SpYearBirth.z:IsAragon 0.015871 0.005459 2.907 0.00365 **
>> ---
>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> Approximate significance of smooth terms:
>> edf Ref.df F p-value
>> s(GeoX,GeoY) 24.01 24.21 31.16 <2e-16 ***
>> s(Word) 352.29 347.00 501.57 <2e-16 ***
>> s(Key) 269.75 289.25 10.76 <2e-16 ***
>> ---
>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) = 0.693 Deviance explained = 69.4%
>> REML score = -22476 Scale est. = 0.038177 n = 112608
>>
>>
>> #################################
>> ### mgcv version 1.7-17, much lower p-values and standard errors than
>> version 1.7-11
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>> s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) -0.0997566 0.0014139 -70.552 < 2e-16 ***
>> RefVratio.z 0.1057049 0.0006565 161.010 < 2e-16 ***
>> IsSemiwordOrDemonstrative 0.2898285 0.0020388 142.155 < 2e-16 ***
>> RefSoundCnt.z 0.1199813 0.0009381 127.905 < 2e-16 ***
>> SpYearBirth.z -0.0113956 0.0006508 -17.509 < 2e-16 ***
>> IsAragon 0.0556777 0.0057143 9.744 < 2e-16 ***
>> PopCntLog_residGeo.z -0.0065037 0.0007938 -8.193 2.58e-16 ***
>> SpYearBirth.z:IsAragon 0.0158712 0.0014829 10.703 < 2e-16 ***
>> ---
>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> Approximate significance of smooth terms:
>> edf Ref.df F p-value
>> s(GeoX,GeoY) 24.01 24.21 31.15 <2e-16 ***
>> s(Word) 352.29 347.00 587.26 <2e-16 ***
>> s(Key) 269.75 313.00 4246.76 <2e-16 ***
>> ---
>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) = 0.693 Deviance explained = 69.4%
>> REML score = -22476 Scale est. = 0.038177 n = 112608
>>
>>
>> #################################
>> ### mgcv version 1.7-17, default: method="fREML", all p-values
>> non-significant and negative edf's of s(X,bs="re")
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>> RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>> s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) -0.099757 1.730235 -0.058 0.954
>> RefVratio.z 0.105705 1.145329 0.092 0.926
>> IsSemiwordOrDemonstrative 0.289828 4.167237 0.070 0.945
>> RefSoundCnt.z 0.119981 1.901158 0.063 0.950
>> SpYearBirth.z -0.011396 0.034236 -0.333 0.739
>> IsAragon 0.055678 0.298629 0.186 0.852
>> PopCntLog_residGeo.z -0.006504 0.041981 -0.155 0.877
>> SpYearBirth.z:IsAragon 0.015871 0.077142 0.206 0.837
>>
>> Approximate significance of smooth terms:
>> edf Ref.df F p-value
>> s(GeoX,GeoY) -1376 1 7.823 0.00516 **
>> s(Word) -8298 347 577.910 < 2e-16 ***
>> s(Key) -1421 316 13.512 < 2e-16 ***
>> ---
>> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) = 0.741 Deviance explained = 69.4%
>> fREML score = -22476 Scale est. = 0.038177 n = 112608
More information about the R-help
mailing list