[R-sig-ME] gamm4 error with large dataset

Wed Apr 30 18:23:30 CEST 2014

On 14-04-30 12:03 PM, Daniel Hocking wrote:
> I am trying to predict daily water temperature from air temperature
> primarily but ideally would include other factors such as
> precipitation and landscape characteristics. I have paired air and
> water temperatures from 600+ sites over a ~10 year period. Some sites
> have daily temperatures for just a couple months and others for
> years, and some for a couple months sporadically in different years.
> I am trying to use a mixed effects gamm so I can include random
> effects of site and year and smooth over day of the year (dOY). My
> dataframe is
> 
> and I get the following error when I run this code
> 
> system.time(gamm4Full <- gamm4(temp ~ airTemp + airTempLagged1 +
> airTempLagged2 + prcp + prcpLagged1 + prcpLagged2 + Latitude +
> Longitude + Forest + Agriculture + BasinElevationM + ReachSlopePCNT +
> CONUSWetland + SurficialCoarseC + s(dOY) + prcp*airTemp, random =
> ~(1| site) + (1 | year), data = etS))
> 
> # Error in crossprod(root.phi %*% Zt) : # Cholmod error 'problem too
> large' at file ../Core/cholmod_aat.c, line 173 # In addition: Warning
> message: #   In optwrap(optimizer, devfun, getStart(start, rho$lower,
> rho$pp),  : #               convergence code 1 from bobyqa: bobyqa --
> maximum number of function evaluations exceeded
> 
> 
> My plan was to try gamm4 and if there was autocorrelation issues to
> switch to gamm within mgcv. I know bam is designed for large data but
> I’m not sure how I would code the random effects using bam. I know in
> general it’s s(dOY, bs = ‘re’) but I’m not sure how to relate this to
> site and year. Ideally I would have random slopes for airTemp effects
> for each site because of things like ground water inputs that we
> don’t measure.
> 

  I can imagine that this problem is caused by the size of the
fixed-effect matrix.  A couple of thoughts (none of them practical, I'm
afraid):

  * I was going to say that it's too bad that we haven't yet managed to
implement a sparse model matrix structure;
  * then I was going to say that a potential trick/workaround for this
(for many-level _categorical_ variables) is to treat the factor as a
random effect, then use devFunOnly/modular structure to fix the theta
parameter for that variable at a large value, making it a pseudo-fixed
effect and getting the benefits of (1) a little bit of regularization
and (2) model matrix sparsity -- but doing this within gamm4 would be
harder/require more hacking
  * then I realized that your fixed-effect model matrix probably isn't
sparse, because it looks like it's made up entirely of continuous covariates
  * that got me thinking about the fact that some of your continuous
covariates only vary at higher levels (i.e. Lat/Long and presumably
Forest, Agriculture, elevation, etc.), and wondering whether there would
be any way to save space by going back to the underlying model
formulation and writing this out in terms of another multiplication of
higher-level covariates times an indicator matrix ...

  ... all of which is fascinating (to me at least) but none of which
actually gets you any farther with your specific problem.  Sorry.

  Ben Bolker