[R] mgcv: inclusion of random intercept in model - based on p-value of smooth or anova?

Mon Jun 11 19:52:18 CEST 2012

Hi Martijn,

Irrespective of the p-value, 'bam' and 'lmer' agree that the variance 
component for 'Placename' is practically zero. In the 'bam' output see 
the 'edf' for s(Placename), or for a more direct comparison call 
gam.vcomp(m1).

As mentioned in ?summary.gam the p-values for "re" terms are fairly 
crude (and certainly don't solve all the usual problems with testing 
variance components for equality to zero), so I would not take them too 
seriously for testing whether your random effect is exactly zero, when 
the estimates/predictions are this close to zero (the typical random 
effect size for Placename is about 1e-8, after all). That said, when I 
randomly re-shuffle Placename, so that the null hypothesis is true, then 
the p-value distribution does look uniform, as it should, despite some 
edfs even smaller than that for the original data.... So the low p-value 
may simply reflect the common problem that even very tiny effects are 
often statistically significant at high sample sizes, I suppose. Anyway, 
unless effects of size 1e-8 are meaningful here, I would drop the 
Placename term.

best,
Simon

ps. I'm not sure what effect the rather heavy tails on the residuals may 
be having here?

On 11/06/12 14:56, Martijn Wieling wrote:
> Dear Simon,
>
> I ran an additional analysis using bam (mgcv 1.7-17) with three random
> intercepts and no non-linearities, and compared these to the results
> of lmer (lme4). Using bam results in a significant random intercept
> (even though it has a very low edf-value), while the lmer results show
> no variance associated to the random intercept of Placename. Should I
> drop the random intercept of Placename and if so, how is this apparent
> from the results of bam?
>
> Summaries of both models are shown below.
>
> With kind regards,
> Martijn
>
> #### l1 = lmer(RefPMIdistMeanLog.c ~ geogamfit + (1|Word) + (1|Key) +
> (1|Placename), data=wrddst); print(l1,cor=F)
>
> Linear mixed model fit by REML
> Formula: RefPMIdistMeanLog.c ~ geogamfit + (1 | Word) + (1 | Key) + (1
> |      Placename)
>     Data: wrddst
>      AIC    BIC logLik deviance REMLdev
>   -44985 -44927  22498   -45009  -44997
> Random effects:
>   Groups    Name        Variance  Std.Dev.
>   Word      (Intercept) 0.0800944 0.283009
>   Key       (Intercept) 0.0013641 0.036933
>   Placename (Intercept) 0.0000000 0.000000
>   Residual              0.0381774 0.195390
> Number of obs: 112608, groups: Word, 357; Key, 320; Placename, 40
>
> Fixed effects:
>              Estimate Std. Error t value
> (Intercept) -0.00342    0.01513   -0.23
> geogamfit    0.99249    0.02612   37.99
>
>
> #### m1 = bam(RefPMIdistMeanLog.c ~ geogamfit + s(Word,bs="re") +
> s(Key,bs="re") + s(Placename,bs="re"), data=wrddst,method="REML");
> summary(m1,freq=F)
>
> Family: gaussian
> Link function: identity
>
> Formula:
> RefPMIdistMeanLog.c ~ geogamfit + s(Word, bs = "re") + s(Key,
>      bs = "re") + s(Placename, bs = "re")
>
> Parametric coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.00342    0.01513  -0.226    0.821
> geogamfit    0.99249    0.02612  37.991<2e-16 ***
> ---
> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>
> Approximate significance of smooth terms:
>                     edf Ref.df       F p-value
> s(Word)      3.554e+02    347 634.716<2e-16 ***
> s(Key)       2.946e+02    316  23.054<2e-16 ***
> s(Placename) 1.489e-04     38   7.282<2e-16 ***
> ---
> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>
> R-sq.(adj) =  0.693   Deviance explained = 69.4%
> REML score = -22498  Scale est. = 0.038177  n = 112608
>
>
> On Wed, May 23, 2012 at 11:30 AM, Simon Wood-4 [via R]
> <ml-node+s789695n4631060h52 at n4.nabble.com>  wrote:
>> Having looked at this further, I've made some changes in mgcv_1.7-17 to
>> the p-value computations for terms that can be penalized to zero during
>> fitting (e.g. s(x,bs="re"), s(x,m=1) etc).
>>
>> The Wald statistic based p-values from summary.gam and anova.gam (i.e.
>> what you get from e.g. anova(a) where a is a fitted gam object) are
>> quite well founded for smooth terms that are non-zero under full
>> penalization (e.g. a cubic spline is a straight line under full
>> penalization). For such smooths, an extension of Nychka's (1988) result
>> on CI's for splines gives a well founded distributional result on which
>> to base a Wald statistic. However, the Nychka result requires the
>> smoothing bias to be substantially less than the smoothing estimator
>> variance, and this will often not be the case if smoothing can actually
>> penalize a term to zero (to understand why, see argument in appendix of
>> Marra&  Wood, 2012, Scandinavian Journal of Statistics, 39,53-74).
>>
>> Simulation testing shows that this theoretical concern has serious
>> practical consequences. So for terms that can be penalized to zero,
>> alternative approximations have to be used, and these are now
>> implemented in mgcv_1.7-17 (see ?summary.gam).
>>
>> The approximate test performed by anova(a,b) (a and b are fitted "gam"
>> objects) is less well founded. It is a reasonable approximation when
>> each smooth term in the models could in principle be well approximated
>> by an unpenalized term of rank approximately equal to the edf of the
>> smooth term, but otherwise the p-values produced are likely to be much
>> too small. In particular simulation testing suggests that the test is
>> not to be trusted with s(...,bs="re") terms, and can be poor if the
>> models being compared involve any terms that can be penalized to zero
>> during fitting. (Although the mechanisms are a little different, this is
>> similar to the problem we would have if the models were viewed as
>> regular mixed models and we tried to use a GLRT to test variance
>> components for equality to zero).
>>
>> These issues are now documented in ?anova.gam and ?summary.gam...
>>
>> Simon
>>
>> On 08/05/12 15:01, Martijn Wieling wrote:
>>
>>> Dear useRs,
>>>
>>> I am using mgcv version 1.7-16. When I create a model with a few
>>> non-linear terms and a random intercept for (in my case) country using
>>> s(Country,bs="re"), the representative line in my model (i.e.
>>> approximate significance of smooth terms) for the random intercept
>>> reads:
>>>                           edf       Ref.df     F          p-value
>>> s(Country)       36.127 58.551   0.644    0.982
>>>
>>> Can I interpret this as there being no support for a random intercept
>>> for country? However, when I compare the simpler model to the model
>>> including the random intercept, the latter appears to be a significant
>>> improvement.
>>>
>>>> anova(gam1,gam2,test="F")
>>> Model 1: ....
>>> Model 2: .... + s(BirthNation, bs="re")
>>>     Resid. Df Resid. Dev     Df Deviance      F    Pr(>F)
>>> 1    789.44     416.54
>>> 2    753.15     373.54 36.292   43.003 2.3891 1.225e-05 ***
>>>
>>> I hope somebody could help me in how I should proceed in these
>>> situations. Do I include the random intercept or not?
>>>
>>> I also have a related question. When I used to create a mixed-effects
>>> regression model using lmer and included e.g., an interaction in the
>>> fixed-effects structure, I would test if the inclusion of this
>>> interaction was warranted using anova(lmer1,lmer2). It then would show
>>> me that I invested 1 additional df and the resulting (possibly
>>> significant) improvement in fit of my model.
>>>
>>> This approach does not seem to work when using gam. In this case an
>>> apparent investment of 1 degree of freedom for the interaction, might
>>> result in an actual decrease of the degrees of freedom invested by the
>>> total model (caused by a decrease of the edf's of splines in the model
>>> with the interaction). In this case, how would I proceed in
>>> determining if the model including the interaction term is better?
>>>
>>> With kind regards,
>>> Martijn Wieling
>>>
>>> --
>>> *******************************************
>>> Martijn Wieling
>>> http://www.martijnwieling.nl
>>> [hidden email]
>>> +31(0)614108622
>>> *******************************************
>>> University of Groningen
>>> http://www.rug.nl/staff/m.b.wieling
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>> --
>> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
>> +44 (0)1225 386603               http://people.bath.ac.uk/sw283
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://r.789695.n4.nabble.com/mgcv-inclusion-of-random-intercept-in-model-based-on-p-value-of-smooth-or-anova-tp4617585p4631060.html
>> To unsubscribe from mgcv: inclusion of random intercept in model - based on
>> p-value of smooth or anova?, click here.
>> NAML
>

-- 
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603               http://people.bath.ac.uk/sw283