[R] gam (mgcv) vs. multiple regression breakpoint analysis: inconsistencies?

Wed May 23 19:09:58 CEST 2012

Dear useRs,

I have a question with respect to fitting a non-linearity using gam
(mgcv package, version 1.7-16).

In a study I'm currently conducting, I'd like to find out if there is
a breakpoint after which the effect of Age of Acquisition (AOA) of the
second language changes. I.e. if the slope of AOA before the
breakpoint (at a certain AOA) is different from the slope past the
breakpoint. For this purpose, I can use the breakpoint analysis
proposed by Baayen (2008) using lm or lmer. Using that approach I
clearly find a breakpoint which improves the fit of the model
significantly over the simpler model (on the basis of model
comparison).

Given that I can model a non-linearity in a gam (from mgcv) using
s(...), I'd prefer to show the presence of a breakpoint (i.e. a
non-linearity) by using a smooth. Instead of specifying an explicit
breakpoint, I would then include s(AOA). I'd expect if a breakpoint
would be found using multiple regression, this would be detected by
the smooth as well.

When I create a simple gam (model g1, below), I do find a
non-linearity consistent with the breakpoint analysis. However, when I
include an additional covariate (in model g2, below), the
non-linearity disappears (this is inconsistent with the breakpoint
analysis including the covariate). However, if I include the
non-linearity estimated in g1 in addition to the covariate (model g3,
below), this improves the fit over g2 (which did not detect a
non-linearity).

Please note that the actual analysis (not shown) contains two
breakpoints in addition to a random-effect factor. Also breakpoints
are highly significant improvements (p = 0.0004) over the models
without breakpoints (the example below has less convincing p-values,
but exhibits the same pattern).

I'd really appreciate it if someone could help shed light on these
findings... Perhaps I'm doing something wrong, but the results go
against my intuition.

With kind regards,
Martijn Wieling,
University of Groningen

### breakpoint analysis using lm, both with and without covariate
# note that AoEO indicates the AOA
# LR denotes Length of Residence

deviances1 = rep(Inf, 20)
deviances2 = rep(Inf, 20)
for (breakpoint in 6:20) {
	spk$ShiftedAoEO = spk$AoEO - breakpoint;
	spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0);
	m1 = lm(LDlog ~ ShiftedAoEO:PastBreakPoint,data=spk)
	m2 = lm(LDlog ~ LR + ShiftedAoEO:PastBreakPoint,data=spk)
	deviances1[breakpoint] = deviance(m1)
	deviances2[breakpoint] = deviance(m2)
}

breakpoint = which(deviances1 == min(deviances1)) # 18
spk$ShiftedAoEO = spk$AoEO - breakpoint;
spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0);
m1 = lm(LDlog ~ ShiftedAoEO:PastBreakPoint,data=spk)
m0 = lm(LDlog ~ AoEO,data=spk)
anova(m0,m1) # m1 is better (p = 0.013)

breakpoint = which(deviances2 == min(deviances2)) # 19
spk$ShiftedAoEO = spk$AoEO - breakpoint;
spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0);
m2 = lm(LDlog ~ LR + ShiftedAoEO:PastBreakPoint,data=spk)
m0 = lm(LDlog ~ LR + AoEO,data=spk)
anova(m0,m2) # m2 is better (p = 0.033)

summary(m1) # the line is steep before the breakpoint (18), flatter after
#ShiftedAoEO:PastBreakPointFALSE  0.014795   0.001662    8.900  < 2e-16 ***
#ShiftedAoEO:PastBreakPointTRUE   0.007713   0.001723    4.477 8.65e-06 ***

summary(m2) # the line is steep before the breakpoint (19), flatter after
#ShiftedAoEO:PastBreakPointFALSE  0.0139418  0.0016260    8.575  < 2e-16 ***
#ShiftedAoEO:PastBreakPointTRUE   0.0076672  0.0018330    4.183  3.2e-05 ***
### end of breakpoint analysis

### alternative analysis using gam
library(mgcv) # 1.7-16
g1 = gam(LDlog ~ s(AoEO), data=spk, method="ML")
plot(g1) # steep before AoEO ~ 18, after that flatter
summary(g1)
#        edf Ref.df   F    p-value
#s(AoEO) 1.7  2.126 73.17  <2e-16 ***

g0 = gam(LDlog ~ AoEO, data=spk, method="ML")
anova(g0,g1,test="F") # g1 is better (p = 0.024)

g2 = gam(LDlog ~ LR + s(AoEO), data=spk, method="ML") # LR improves the fit
plot(g2) # linear relationship, no breakpoint
summary(g2) # R-sq.(adj) = 0.167
#             Estimate Std. Error  t value Pr(>|t|)
#LR          -0.0018097  0.0005808   -3.116   0.0019 **
#
#        edf Ref.df   F  p-value
#s(AoEO)   1      1 145  <2e-16 ***

# use non-linearity from g1
pred = predict(g1, type="terms") # use predicted non-linear effect of g1
spk$AoEO.s = pred[,"s(AoEO)"]
g3 = gam(LDlog ~ LR + AoEO.s, data=spk, method="ML")
summary(g3) # both LR and AoEO.s significant, R-sq.(adj) = 0.171

# manual comparison of g2 and g3 (df of g3 = df of g2 + 0.7)
pf(4.778,0.7,802.3,lower.tail=F) # p = 0.041 => g3 is better than g2
####