[R] Modifying a design matrix in mgcv

Wed Aug 14 21:32:30 CEST 2013

On Aug 14, 2013, at 11:08 AM, Andrew Crane-Droesch wrote:

> Hello,
> 
> I am trying to make a slight modification to the way gam (mgcv) works.  
> I want to modify the G$X matrix, which is the design matrix, to 
> accommodate an estimator that I am trying to program.  I am working with 
> panel data, and I want to take all continuous variables, including basis 
> functions evaluated at covariate variables, subtract their groupwise 
> means, and then add in their overall mean.  (This is an alternative to 
> having a dummy variable for every single group, and I'm working in a 
> context where random effects aren't appropriate).  I'd then do the same 
> thing with G$y.  Afterwards I want to pass the modified G$X and G$Y 
> matrices to estimate.gam, which takes G as an argument.
> 
> As far as I can tell, G$X is organized to have the parametric terms 
> (including factors) at the front with column names, and nonparametric 
> terms at the end without column names.  So identifying which terms to 
> apply the transformation to should be straightforward.
> 
> I've had some luck tweaking the internals of mgcv before, changing the 
> default plotting behavior of vis.gam.  But I am having trouble doing so 
> with gam itself.  To test things out, I did the following:
> 
>  * type "gam" into the terminal, to get the printout of the function's
>    code.
>  * add a line below line 70: G$X = G$X^2 (just to see if I can make an
>    arbitrary transformation)
>  * change the function name from gam to mod.gam, run it, and then run
>    mod.gam on an arbitrary dataset.
> 
> When I do so, I get the following error:
> Error in mod.gam(y ~ s(x)) : could not find function "variable.summary"
> 
> It turns out that variable.summary, along with gam.setup, estimate.gam, 
> and perhaps other constituents of gam do NOT have help pages, nor does 
> typing them into the terminal reveal their code.  Am I encountering 
> functions compiled from some non-R language ©?

You are encountering functions that are not exported from mgcv. So the environment where the interpreter looks for them is not the same as when `gam` is called. You could get them to "work" by prepending `mgcv::` to each instance of the "not-found" functions. You might also try just this:

environment(mod.gam) <- environment(gam)

I seem to get success with that strategy but I am not qualified to assess your coding chnages

> environment(mod.gam) <- environment(gam)
> modm = mod.gam(y~s(x))
> 
> modm

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x)

Estimated degrees of freedom:
8.2  total = 9.2 

GCV score: 1.375422

-- 
David.

>  gam.setup, estimate.gam
> 
> Regardless, I would greatly appreciate any help with my problem of 
> transforming the model matrix before fitting.  I'd very much prefer to 
> use mgcv, with all of the work that has obviously gone into it, rather 
> than reinventing the wheel.
> 
> My code for reproducing the error is below.
> 
> Thanks in advance for any help.
> 
> Best,
> Andrew
> 
> library(mgcv)
> gam
> mod.gam = function (formula, family = gaussian(), data = list(), weights 
> = NULL,
>     subset = NULL, na.action, offset = NULL, method = "GCV.Cp",
>     optimizer = c("outer", "newton"), control = list(), scale = 0,
>     select = FALSE, knots = NULL, sp = NULL, min.sp = NULL, H = NULL,
>     gamma = 1, fit = TRUE, paraPen = NULL, G = NULL, in.out = NULL,
>     ...)
> {
>     control <- do.call("gam.control", control)
>     if (is.null(G)) {
>         gp <- interpret.gam(formula)
>         cl <- match.call()
>         mf <- match.call(expand.dots = FALSE)
>         mf$formula <- gp$fake.formula
>         mf$family <- mf$control <- mf$scale <- mf$knots <- mf$sp <- 
> mf$min.sp <- mf$H <- mf$select <- mf$gamma <- mf$method <- mf$fit <- 
> mf$paraPen <- mf$G <- mf$optimizer <- mf$in.out <- mf$... <- NULL
>         mf$drop.unused.levels <- TRUE
>         mf[[1]] <- as.name("model.frame")
>         pmf <- mf
>         mf <- eval(mf, parent.frame())
>         if (nrow(mf) < 2)
>             stop("Not enough (non-NA) data to do anything meaningful")
>         terms <- attr(mf, "terms")
>         vars <- all.vars(gp$fake.formula[-2])
>         inp <- parse(text = paste("list(", paste(vars, collapse = ","),
>             ")"))
>         if (!is.list(data) && !is.data.frame(data))
>             data <- as.data.frame(data)
>         dl <- eval(inp, data, parent.frame())
>         names(dl) <- vars
>         var.summary <- variable.summary(gp$pf, dl, nrow(mf))
>         rm(dl)
>         pmf$formula <- gp$pf
>         pmf <- eval(pmf, parent.frame())
>         pterms <- attr(pmf, "terms")
>         if (is.character(family))
>             family <- eval(parse(text = family))
>         if (is.function(family))
>             family <- family()
>         if (is.null(family$family))
>             stop("family not recognized")
>         if (family$family[1] == "gaussian" && family$link ==
>             "identity")
>             am <- TRUE
>         else am <- FALSE
>         if (!control$keepData)
>             rm(data)
>         G <- gam.setup(gp, pterms = pterms, data = mf, knots = knots,
>             sp = sp, min.sp = min.sp, H = H, absorb.cons = TRUE,
>             sparse.cons = 0, select = select, idLinksBases = 
> control$idLinksBases,
>             scale.penalty = control$scalePenalty, paraPen = paraPen)
>         G$var.summary <- var.summary
>         G$family <- family
>         if (ncol(G$X) > nrow(G$X) + nrow(G$C))
>             stop("Model has more coefficients than data")
>         G$terms <- terms
>         G$pterms <- pterms
>         G$mf <- mf
>         G$cl <- cl
>         G$am <- am
>         if (is.null(G$offset))
>             G$offset <- rep(0, G$n)
>         G$min.edf <- G$nsdf - dim(G$C)[1]
>         if (G$m)
>             for (i in 1:G$m) G$min.edf <- G$min.edf + 
> G$smooth[[i]]$null.space.dim
>         G$formula <- formula
>         environment(G$formula) <- environment(formula)
>     }
>     if (!fit)
>         return(G)
>     G$conv.tol <- control$mgcv.tol
>     G$max.half <- control$mgcv.half
>     G$X = G$X^2
>     object <- estimate.gam(G, method, optimizer, control, in.out,
>         scale, gamma, ...)
>     if (!is.null(G$L)) {
>         object$full.sp <- as.numeric(exp(G$L %*% log(object$sp) +
>             G$lsp0))
>         names(object$full.sp) <- names(G$lsp0)
>     }
>     names(object$sp) <- names(G$sp)
>     object$paraPen <- G$pP
>     object$formula <- G$formula
>     object$var.summary <- G$var.summary
>     object$cmX <- G$cmX
>     object$model <- G$mf
>     object$na.action <- attr(G$mf, "na.action")
>     object$control <- control
>     object$terms <- G$terms
>     object$pterms <- G$pterms
>     object$assign <- G$assign
>     object$contrasts <- G$contrasts
>     object$xlevels <- G$xlevels
>     object$offset <- G$offset
>     if (!is.null(G$Xcentre))
>         object$Xcentre <- G$Xcentre
>     if (control$keepData)
>         object$data <- data
>     object$df.residual <- nrow(G$X) - sum(object$edf)
>     object$min.edf <- G$min.edf
>     if (G$am && !(method %in% c("REML", "ML", "P-ML", "P-REML")))
>         object$optimizer <- "magic"
>     else object$optimizer <- optimizer
>     object$call <- G$cl
>     class(object) <- c("gam", "glm", "lm")
>     object
> }
> x = rnorm(100)
> y = exp(y)+rnorm(100)
> m = gam(y~s(x))
> modm = mod.gam(y~s(x))
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA