[R] mgcv gam bs=re questions

Sat May 3 21:51:33 CEST 2014

Dear R-helpers,

I am working on a project assessing the prevalence and variance (random 
effects) of linear and nonlinear trends in a data set that has short 
time series (each time series identified as PID 1 through 5). I am using 
mgcv (gam) with the bs=”re” option (more on why not gamm or gamm4 at the 
end). I want to know whether trend is linear or nonlinear in each time 
series, and whether trend varies significantly over time series (e.g., 
PID 1 through 5 below)--e.g., whether they have significantly different 
edfs. The analysis predicts a count outcome (DVY) from time (SessIDX) 
and treatment (TX), using a quasipoisson approach to correct standard 
errors for overdispersion. The dataset in the examples below (SID5DVID1) 
has 5 time series (PID 1 through 5) with 93, 97, 70, 86, and 103 time 
points each, respectively. E.g.,

JID SID PID DVID SessIDX DVY TX
1     1   5   1    1       1  23  0
2     1   5   1    1       2  20  0
3     1   5   1    1       3  30  0
4     1   5   1    1       4  40  0

…

434   1   5   5    1     100   5  1
435   1   5   5    1     101   3  1
436   1   5   5    1     102  12  1
437   1   5   5    1     103   7  1
 >

My question concerns how to interpret the random effects that result 
from variations in the model. A model with smoothed trend and no random 
effect yields

M1 <- gam(DVY  ~ s(SessIDX) + factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")

Approximate significance of smooth terms:
              edf Ref.df     F  p-value
s(SessIDX) 1.876  2.403 11.52 4.34e-06 ***

So trend over all five time series is linear and significant. If I allow 
trend to interact with PID (be different for each time series), without 
a random effect, I get

M1a <- gam(DVY  ~ s(SessIDX, by = factor(PID)) + factor(TX),
            data = SID5DVID1,
            family = quasipoisson(link="log"), method="REML")
summary(M1a,all.p=TRUE)

Approximate significance of smooth terms:
                           edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 1.001  1.001 31.827 2.98e-08 ***
s(SessIDX):factor(PID)2 1.000  1.000  2.658 0.103724
s(SessIDX):factor(PID)3 5.100  5.886  8.674 9.42e-09 ***
s(SessIDX):factor(PID)4 1.337  1.602  9.822 0.000366 ***
s(SessIDX):factor(PID)5 3.998  4.924  7.967 4.40e-07 ***

It appears trend may be quite different for each time series. So now I 
want to test whether trend varies significantly over time series using 
bs=”re”. One model is

M2 <- gam(DVY  ~ s(SessIDX, bs = "re") + factor(TX),
            data = SID5DVID1,
            family = quasipoisson(link="log"), method="REML")
summary(M2,all.p=TRUE)
gam.vcomp(M2)

Approximate significance of smooth terms:
               edf Ref.df     F  p-value
s(SessIDX) 0.9609      1 24.48 6.53e-07 ***
…
Standard deviations and 0.95 confidence intervals:
               std.dev      lower      upper
s(SessIDX) 0.01470809 0.00347556 0.06224258
scale      3.20933040 3.00277692 3.43009218

Rank: 2/2
 >

So the random effect std.dev is 0.01470809, and it is significant with 
p-value = 6.53e-07. I had originally thought that this meant that trend 
varied significantly over the five time series. However, after having 
read a recent article (Gary J. McKeown and Ian Sneddon, DOI: 
10.1037/a0034282), I am not sure I am correctly interpreting this. In 
their article, this approach (a) changes the shape of the nonlinear 
trend only slightly compared to not using random effects [i.e., just 
using s(SessIDX)] but still produces only one trend line for all time 
series, and (b) increases the width of the confidence bands around the 
trend line (which of course makes perfect sense). In other words, it 
does not obviously allow each time series (PID) to have a quite 
different edf or trend line. So, parallel with M1a above, I can run:

M5a <- gam(DVY  ~ s(SessIDX, by = factor(PID), bs="re") + factor(TX),
            data = SID5DVID1,
            family = quasipoisson(link="log"), method="REML")
summary(M5a,all.p=TRUE)
gam.vcomp(M5a)

Approximate significance of smooth terms:
                            edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 0.9239      1  38.76 0.000368 ***
s(SessIDX):factor(PID)2 0.9892      1 107.00  < 2e-16 ***
s(SessIDX):factor(PID)3 0.9867      1  94.85  < 2e-16 ***
s(SessIDX):factor(PID)4 0.9830      1 108.33 1.64e-13 ***
s(SessIDX):factor(PID)5 0.9658      1  73.41 1.32e-07 ***

Standard deviations and 0.95 confidence intervals:

                             std.dev       lower      upper
s(SessIDX):factor(PID)1 0.008853821 0.001960054 0.03999387
s(SessIDX):factor(PID)2 0.065276527 0.016064261 0.26524874
s(SessIDX):factor(PID)3 0.048967099 0.012003711 0.19975296
s(SessIDX):factor(PID)4 0.027830597 0.006777380 0.11428341
s(SessIDX):factor(PID)5 0.012218325 0.002893741 0.05158977
scale                   2.559988984 2.394424050 2.73700208

This seems to produce five random effects, all apparently significant. 
But it is not clear to me how to interpret these random effects. For 
example, consider s(SessIDX):factor(PID)1 0.008853821 above. Clearly 
this is not what I want—it is not a measure of whether the five cases 
differ significantly from each other in trend. I am guessing this random 
effect concerns whether time points within PID1 differ significantly 
from each other?

One might, of course, use gamm or gamm4 to address the random effects 
question. I had hoped, however, to run all analyses within the same 
program (gam) to avoid confounding differences in results with 
differences in the estimation routines that different programs use.

Running it all in gam, however, encounters another problem, that it does 
not appear that gam can use bs=”re” while fixing the trend to be linear. 
For example:

 > M4 <- gam(DVY  ~ s(SessIDX, bs = "re", k = 2, fx=TRUE) +
+                factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")
 > summary(M4,all.p=TRUE)
Error in b$smooth[[m]]$S[[1]] : subscript out of bounds
 > gam.vcomp(M4)
Error in if (x$smooth[[i]]$sp[j] < 0) { : argument is of length zero
 >

I have read extensively to try to understand all this (e.g., Wood 2013 
“A simple test for random effects in regression models”) but have not 
succeeded. Any help or advice would be appreciated very very much.

Will Shadish

-- 
William R. Shadish
Distinguished Professor
Founding Faculty

Mailing Address:
William R. Shadish
University of California
School of Social Sciences, Humanities and Arts
5200 North Lake Rd
Merced CA  95343

Physical/Delivery Address:
University of California Merced
ATTN: William Shadish
School of Social Sciences, Humanities and Arts
Facilities Services Building A
5200 North Lake Rd.
Merced, CA 95343

209-228-4372 voice
209-228-4007 fax (communal fax: be sure to include cover sheet)
wshadish at ucmerced.edu
http://faculty.ucmerced.edu/wshadish/index.htm
http://psychology.ucmerced.edu