[R] Why do two different calls to mgcv::gamm give different p-values?

Tue Oct 2 14:26:46 CEST 2018

Dear Øystein,

In your code 'id' is set up to be numeric. In your first gamm call it 
gets coerced to a factor (since nothing else would make sense). In  your 
second gamm call it is simply treated as numeric, since that could makes 
sense. To make the two models equivalent you just need to substitute:

id <- as.factor(rep(1:ng, n/ng))

into your loop.

best,
Simon

On 01/10/18 08:15, Øystein Sørensen wrote:
> I have longitudinal data where a response y_{ij} has been measured
> repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
> Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
> see how the response varies with age, and have reason to believe that a
> nonlinear fit is needed. I hence wish to model the relationship using an
> additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
> \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
> \sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
> \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
>
> Reading the documentation to the mgcv package, I see that such models can
> be set up by calling the mgcv::gamm two different ways. One way shown to
> set up such a model is with the call
> b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
> where id is an indicator of the specific individual. In this way, the
> random effect is specified in a list. The other way is to set up the random
> effect with a smooth of the re:
> b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
>
> As far as I can understand, these two setups should create the same
> underlying model matrices. However, when running them on my data, the
> p-values for the smooth terms, as well as their estimated degrees of
> freedom, are different.
>
> Below is a reproducible example on simulated data, which show that these
> two types of specifying the models give different p-values and estimated
> degrees of freedom. On my real data, these differences are sometimes even
> more exaggerated.
>
> My question is: Are not these two calls to gamm suppose to estimate the
> same model, or have I misunderstood?
>
> Here is a reproducible example:
>
> library(mgcv)
> set.seed(100)
> n_sim <- 100
> df <- data.frame(
>    model1_pval = numeric(n_sim),
>    model1_edf = numeric(n_sim),
>    model2_pval = numeric(n_sim),
>    model2_edf = numeric(n_sim)
> )
>
> # Number of observations
> n <- 500
> # Number of groups
> ng <- 100
>
> for(i in 1:n_sim){
>    # Draw the fixed effect covariate
>    x <- rnorm(n)
>    # Set up the group
>    id <- rep(1:ng, n/ng)
>    # Draw the random effect
>    z <- rnorm(ng)
>    # Draw the response
>    y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
>
>    # Fit the two different models
>    b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
>    b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
>
>    df[i, "model1_pval"] <- anova(b1$gam)$s.pv
>    df[i, "model1_edf"] <- anova(b1$gam)$edf
>    df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
>    df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
> }
>
> # Differences between model 1 and model 2 p values
> mean(df$model1_pval)
> #> [1] 6.790546e-21
> mean(df$model2_pval)
> #> [1] 3.090694e-14
> max(abs(df$model1_pval - df$model2_pval))
> #> [1] 2.770545e-12
>
> # Differences between model1 and model 2 estimated degrees of freedom
> mean(df$model1_edf)
> #> [1] 3.829992
> mean(df$model2_edf)
> #> [1] 3.731438
> max(abs(df$model1_edf - df$model2_edf))
> #> [1] 0.6320281
>
>
> Thanks in advance,
> Øystein Sørensen
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.