[R] Why do two different calls to mgcv::gamm give different p-values?
Øystein Sørensen
oy@tein@@oren@en@1985 @ending from gm@il@com
Mon Oct 1 08:15:06 CEST 2018
I have longitudinal data where a response y_{ij} has been measured
repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
see how the response varies with age, and have reason to believe that a
nonlinear fit is needed. I hence wish to model the relationship using an
additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
\epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
\sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
\epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
Reading the documentation to the mgcv package, I see that such models can
be set up by calling the mgcv::gamm two different ways. One way shown to
set up such a model is with the call
b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
where id is an indicator of the specific individual. In this way, the
random effect is specified in a list. The other way is to set up the random
effect with a smooth of the re:
b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
As far as I can understand, these two setups should create the same
underlying model matrices. However, when running them on my data, the
p-values for the smooth terms, as well as their estimated degrees of
freedom, are different.
Below is a reproducible example on simulated data, which show that these
two types of specifying the models give different p-values and estimated
degrees of freedom. On my real data, these differences are sometimes even
more exaggerated.
My question is: Are not these two calls to gamm suppose to estimate the
same model, or have I misunderstood?
Here is a reproducible example:
library(mgcv)
set.seed(100)
n_sim <- 100
df <- data.frame(
model1_pval = numeric(n_sim),
model1_edf = numeric(n_sim),
model2_pval = numeric(n_sim),
model2_edf = numeric(n_sim)
)
# Number of observations
n <- 500
# Number of groups
ng <- 100
for(i in 1:n_sim){
# Draw the fixed effect covariate
x <- rnorm(n)
# Set up the group
id <- rep(1:ng, n/ng)
# Draw the random effect
z <- rnorm(ng)
# Draw the response
y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
# Fit the two different models
b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
df[i, "model1_pval"] <- anova(b1$gam)$s.pv
df[i, "model1_edf"] <- anova(b1$gam)$edf
df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
}
# Differences between model 1 and model 2 p values
mean(df$model1_pval)
#> [1] 6.790546e-21
mean(df$model2_pval)
#> [1] 3.090694e-14
max(abs(df$model1_pval - df$model2_pval))
#> [1] 2.770545e-12
# Differences between model1 and model 2 estimated degrees of freedom
mean(df$model1_edf)
#> [1] 3.829992
mean(df$model2_edf)
#> [1] 3.731438
max(abs(df$model1_edf - df$model2_edf))
#> [1] 0.6320281
Thanks in advance,
Øystein Sørensen
[[alternative HTML version deleted]]
More information about the R-help
mailing list