[R] mgcv: inclusion of random intercept in model - based on p-value of smooth or anova?
Simon Wood
s.wood at bath.ac.uk
Mon Jun 11 19:52:18 CEST 2012
Hi Martijn,
Irrespective of the p-value, 'bam' and 'lmer' agree that the variance
component for 'Placename' is practically zero. In the 'bam' output see
the 'edf' for s(Placename), or for a more direct comparison call
gam.vcomp(m1).
As mentioned in ?summary.gam the p-values for "re" terms are fairly
crude (and certainly don't solve all the usual problems with testing
variance components for equality to zero), so I would not take them too
seriously for testing whether your random effect is exactly zero, when
the estimates/predictions are this close to zero (the typical random
effect size for Placename is about 1e-8, after all). That said, when I
randomly re-shuffle Placename, so that the null hypothesis is true, then
the p-value distribution does look uniform, as it should, despite some
edfs even smaller than that for the original data.... So the low p-value
may simply reflect the common problem that even very tiny effects are
often statistically significant at high sample sizes, I suppose. Anyway,
unless effects of size 1e-8 are meaningful here, I would drop the
Placename term.
best,
Simon
ps. I'm not sure what effect the rather heavy tails on the residuals may
be having here?
On 11/06/12 14:56, Martijn Wieling wrote:
> Dear Simon,
>
> I ran an additional analysis using bam (mgcv 1.7-17) with three random
> intercepts and no non-linearities, and compared these to the results
> of lmer (lme4). Using bam results in a significant random intercept
> (even though it has a very low edf-value), while the lmer results show
> no variance associated to the random intercept of Placename. Should I
> drop the random intercept of Placename and if so, how is this apparent
> from the results of bam?
>
> Summaries of both models are shown below.
>
> With kind regards,
> Martijn
>
> #### l1 = lmer(RefPMIdistMeanLog.c ~ geogamfit + (1|Word) + (1|Key) +
> (1|Placename), data=wrddst); print(l1,cor=F)
>
> Linear mixed model fit by REML
> Formula: RefPMIdistMeanLog.c ~ geogamfit + (1 | Word) + (1 | Key) + (1
> | Placename)
> Data: wrddst
> AIC BIC logLik deviance REMLdev
> -44985 -44927 22498 -45009 -44997
> Random effects:
> Groups Name Variance Std.Dev.
> Word (Intercept) 0.0800944 0.283009
> Key (Intercept) 0.0013641 0.036933
> Placename (Intercept) 0.0000000 0.000000
> Residual 0.0381774 0.195390
> Number of obs: 112608, groups: Word, 357; Key, 320; Placename, 40
>
> Fixed effects:
> Estimate Std. Error t value
> (Intercept) -0.00342 0.01513 -0.23
> geogamfit 0.99249 0.02612 37.99
>
>
> #### m1 = bam(RefPMIdistMeanLog.c ~ geogamfit + s(Word,bs="re") +
> s(Key,bs="re") + s(Placename,bs="re"), data=wrddst,method="REML");
> summary(m1,freq=F)
>
> Family: gaussian
> Link function: identity
>
> Formula:
> RefPMIdistMeanLog.c ~ geogamfit + s(Word, bs = "re") + s(Key,
> bs = "re") + s(Placename, bs = "re")
>
> Parametric coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.00342 0.01513 -0.226 0.821
> geogamfit 0.99249 0.02612 37.991<2e-16 ***
> ---
> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>
> Approximate significance of smooth terms:
> edf Ref.df F p-value
> s(Word) 3.554e+02 347 634.716<2e-16 ***
> s(Key) 2.946e+02 316 23.054<2e-16 ***
> s(Placename) 1.489e-04 38 7.282<2e-16 ***
> ---
> Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>
> R-sq.(adj) = 0.693 Deviance explained = 69.4%
> REML score = -22498 Scale est. = 0.038177 n = 112608
>
>
> On Wed, May 23, 2012 at 11:30 AM, Simon Wood-4 [via R]
> <ml-node+s789695n4631060h52 at n4.nabble.com> wrote:
>> Having looked at this further, I've made some changes in mgcv_1.7-17 to
>> the p-value computations for terms that can be penalized to zero during
>> fitting (e.g. s(x,bs="re"), s(x,m=1) etc).
>>
>> The Wald statistic based p-values from summary.gam and anova.gam (i.e.
>> what you get from e.g. anova(a) where a is a fitted gam object) are
>> quite well founded for smooth terms that are non-zero under full
>> penalization (e.g. a cubic spline is a straight line under full
>> penalization). For such smooths, an extension of Nychka's (1988) result
>> on CI's for splines gives a well founded distributional result on which
>> to base a Wald statistic. However, the Nychka result requires the
>> smoothing bias to be substantially less than the smoothing estimator
>> variance, and this will often not be the case if smoothing can actually
>> penalize a term to zero (to understand why, see argument in appendix of
>> Marra& Wood, 2012, Scandinavian Journal of Statistics, 39,53-74).
>>
>> Simulation testing shows that this theoretical concern has serious
>> practical consequences. So for terms that can be penalized to zero,
>> alternative approximations have to be used, and these are now
>> implemented in mgcv_1.7-17 (see ?summary.gam).
>>
>> The approximate test performed by anova(a,b) (a and b are fitted "gam"
>> objects) is less well founded. It is a reasonable approximation when
>> each smooth term in the models could in principle be well approximated
>> by an unpenalized term of rank approximately equal to the edf of the
>> smooth term, but otherwise the p-values produced are likely to be much
>> too small. In particular simulation testing suggests that the test is
>> not to be trusted with s(...,bs="re") terms, and can be poor if the
>> models being compared involve any terms that can be penalized to zero
>> during fitting. (Although the mechanisms are a little different, this is
>> similar to the problem we would have if the models were viewed as
>> regular mixed models and we tried to use a GLRT to test variance
>> components for equality to zero).
>>
>> These issues are now documented in ?anova.gam and ?summary.gam...
>>
>> Simon
>>
>> On 08/05/12 15:01, Martijn Wieling wrote:
>>
>>> Dear useRs,
>>>
>>> I am using mgcv version 1.7-16. When I create a model with a few
>>> non-linear terms and a random intercept for (in my case) country using
>>> s(Country,bs="re"), the representative line in my model (i.e.
>>> approximate significance of smooth terms) for the random intercept
>>> reads:
>>> edf Ref.df F p-value
>>> s(Country) 36.127 58.551 0.644 0.982
>>>
>>> Can I interpret this as there being no support for a random intercept
>>> for country? However, when I compare the simpler model to the model
>>> including the random intercept, the latter appears to be a significant
>>> improvement.
>>>
>>>> anova(gam1,gam2,test="F")
>>> Model 1: ....
>>> Model 2: .... + s(BirthNation, bs="re")
>>> Resid. Df Resid. Dev Df Deviance F Pr(>F)
>>> 1 789.44 416.54
>>> 2 753.15 373.54 36.292 43.003 2.3891 1.225e-05 ***
>>>
>>> I hope somebody could help me in how I should proceed in these
>>> situations. Do I include the random intercept or not?
>>>
>>> I also have a related question. When I used to create a mixed-effects
>>> regression model using lmer and included e.g., an interaction in the
>>> fixed-effects structure, I would test if the inclusion of this
>>> interaction was warranted using anova(lmer1,lmer2). It then would show
>>> me that I invested 1 additional df and the resulting (possibly
>>> significant) improvement in fit of my model.
>>>
>>> This approach does not seem to work when using gam. In this case an
>>> apparent investment of 1 degree of freedom for the interaction, might
>>> result in an actual decrease of the degrees of freedom invested by the
>>> total model (caused by a decrease of the edf's of splines in the model
>>> with the interaction). In this case, how would I proceed in
>>> determining if the model including the interaction term is better?
>>>
>>> With kind regards,
>>> Martijn Wieling
>>>
>>> --
>>> *******************************************
>>> Martijn Wieling
>>> http://www.martijnwieling.nl
>>> [hidden email]
>>> +31(0)614108622
>>> *******************************************
>>> University of Groningen
>>> http://www.rug.nl/staff/m.b.wieling
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>> --
>> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
>> +44 (0)1225 386603 http://people.bath.ac.uk/sw283
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://r.789695.n4.nabble.com/mgcv-inclusion-of-random-intercept-in-model-based-on-p-value-of-smooth-or-anova-tp4617585p4631060.html
>> To unsubscribe from mgcv: inclusion of random intercept in model - based on
>> p-value of smooth or anova?, click here.
>> NAML
>
--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603 http://people.bath.ac.uk/sw283
More information about the R-help
mailing list