[R-sig-ME] gamm4 error with large dataset

Ben Bolker bbolker at gmail.com
Thu May 1 20:45:38 CEST 2014

Ben Bolker <bbolker at ...> writes:

> On 14-04-30 12:03 PM, Daniel Hocking wrote:
> > I am trying to predict daily water temperature from air temperature
> > primarily but ideally would include other factors such as
> > precipitation and landscape characteristics. I have paired air and
> > water temperatures from 600+ sites over a ~10 year period. Some sites
> > have daily temperatures for just a couple months and others for
> > years, and some for a couple months sporadically in different years.
> > I am trying to use a mixed effects gamm so I can include random
> > effects of site and year and smooth over day of the year (dOY).

  [snip to make Gmane happy]

> > 
> > 
> > My plan was to try gamm4 and if there was autocorrelation issues to
> > switch to gamm within mgcv. I know bam is designed for large data but
> > I’m not sure how I would code the random effects using bam. I know in
> > general it’s s(dOY, bs = ‘re’) but I’m not sure how to relate this to
> > site and year. Ideally I would have random slopes for airTemp effects
> > for each site because of things like ground water inputs that we
> > don’t measure.
> > 
>   I can imagine that this problem is caused by the size of the
> fixed-effect matrix.  A couple of thoughts (none of them practical, I'm
> afraid):
>   * I was going to say that it's too bad that we haven't yet managed to
> implement a sparse model matrix structure;
>   * then I was going to say that a potential trick/workaround for this
> (for many-level _categorical_ variables) is to treat the factor as a
> random effect, then use devFunOnly/modular structure to fix the theta
> parameter for that variable at a large value, making it a pseudo-fixed
> effect and getting the benefits of (1) a little bit of regularization
> and (2) model matrix sparsity -- but doing this within gamm4 would be
> harder/require more hacking
>   * then I realized that your fixed-effect model matrix probably isn't
> sparse, because it looks like it's made up entirely of continuous covariates
>   * that got me thinking about the fact that some of your continuous
> covariates only vary at higher levels (i.e. Lat/Long and presumably
> Forest, Agriculture, elevation, etc.), and wondering whether there would
> be any way to save space by going back to the underlying model
> formulation and writing this out in terms of another multiplication of
> higher-level covariates times an indicator matrix ...
>   ... all of which is fascinating (to me at least) but none of which
> actually gets you any farther with your specific problem.  Sorry.
>   Ben Bolker
  A couple of further thoughts:

 * if it's true that the size of the fixed-effects model matrix
is your problem, then bam may not help you either (I would guess,
without bothering to check, that it's intended to reduce the size/
increase the sparsity of the spline bases, rather than messing
with the fixed-effect covariates).  
 * you can check whether the fixed-effects matrix is the problem
by running the problem as a straight linear mixed model, i.e. with
lmer either without dOY or with something simple like ns(dOY,10);
you could also try Doug Bates's MixedModels package from Julia.
 * If you were willing to make year a fixed effect you would only
have a single RE, then you might be able to take advantage of
some model structures for increased efficiency (at least until
you incorporate interactions of site with various covariates)

More information about the R-sig-mixed-models mailing list