[R] Modifying the internals of mgcv's G object to program a within estimator

Thu Aug 15 19:03:56 CEST 2013

I'm trying to write a function to make gam (in mgcv) give me the 
fixed-effects "within" estimator, which is equivalent to the OLS dummy 
variable estimator.  Basically this involves subtracting the 
within-group means from the IVs and the DV, adding their overall means 
back in, and fitting the model.  For OLS, this is equivalent to running 
a regression with a dummy variable for each group.  With many groups, 
this gets computationally prohibitive, especially if you want to 
bootstrap.  Eventually I want to apply this procedure to all of gam's 
splined terms, but for now I'm simply trying to modify gam's G object to 
do so only for one parametrically-represented variable.

The G object feeds everything to gam needed to fit the model.  The code 
below modifies the design matrix, the DV, and the formula.  My 
question:  what else needs to change in G to make the FE estimator 
equivalent to the dummy variable estimator?  Unfortunately I can't find 
any technical information or well-commented code describing how gam 
works in this regard, nor good descriptions of what the constituents of 
G are.  Nor can I look at the gam.setup function -- I think that it is 
compiled in C (or something).  Responses that can make the estimate of 
the coefficient on x (below) equal for the two models would be really 
welcome, as would pointers to useful documentation.

rm(list=ls())

library(mgcv)
library(plyr)
x = rexp(100)
ID = ceiling(runif(100)*10)
y = exp(x)+rnorm(100)+ID
t = ceiling(runif(100)*5)
data = data.frame(y,x,ID,t)
data = data[with(data,order(ID,t)),]

m = gam(y~x+as.factor(ID),fit=FALSE)

transform = function(G,ID){
     rep.row = function(x,n){matrix(rep(x,each=n),nrow=n)}
     X = G$X[,grepl(colnames(data.frame(ID=ID)),colnames(G$X))==FALSE]
     dmframe = as.data.frame(cbind(ID,X))
     dm <- as.matrix(ddply(dmframe, "ID", colwise(scale, center = TRUE, 
scale = FALSE)))
     cm = rep.row(colMeans(dmframe[,2:ncol(dmframe)]),nrow(dmframe))
     X = dm[,2:ncol(dm)]+cm
     G$X = X

     y = G$y
     dmframe = as.data.frame(cbind(ID,y))
     dm <- ddply(dmframe, "ID", colwise(scale, center = TRUE, scale = 
FALSE))
     dm = as.matrix(dm)
     cm = mean(dmframe[,2])
     y = dm[,2]+cm
     G$y = y

     spec = as.character(G$formula)[3]; n=nchar(spec)
     G$formula = 
as.formula(paste(as.character(G$formula)[2],as.character(G$formula)[1],substr(spec,start=1,stop=n-15)))
     }

#m = gam(y~s(x,k=7)+as.factor(ID),fit=FALSE,sp=0)
out = transform(m,data$ID)
G=m; ID=data$ID

mfe = gam(out)
summary(mfe)
mlsdv = gam(y~x+as.factor(ID))
#mlsdv = gam(y~s(x,k=7)+as.factor(ID),sp=0)
summary(mlsdv)

#The coefficient on x should be equal for the two models. The intercept 
should represent the average of the fixed effects.
cbind(as.matrix(t(mlsdv$coef))[,grepl(colnames(data.frame(ID=ID)),colnames(as.matrix(t(mlsdv$coef))))==FALSE], 
mfe$coef)