[R-sig-ME] gamm4 error with large dataset

Thu May 1 20:45:38 CEST 2014

Ben Bolker <bbolker at ...> writes:

> 
> On 14-04-30 12:03 PM, Daniel Hocking wrote:
> > I am trying to predict daily water temperature from air temperature
> > primarily but ideally would include other factors such as
> > precipitation and landscape characteristics. I have paired air and
> > water temperatures from 600+ sites over a ~10 year period. Some sites
> > have daily temperatures for just a couple months and others for
> > years, and some for a couple months sporadically in different years.
> > I am trying to use a mixed effects gamm so I can include random
> > effects of site and year and smooth over day of the year (dOY).

  [snip to make Gmane happy]

> > 
> > 
> > My plan was to try gamm4 and if there was autocorrelation issues to
> > switch to gamm within mgcv. I know bam is designed for large data but
> > I’m not sure how I would code the random effects using bam. I know in
> > general it’s s(dOY, bs = ‘re’) but I’m not sure how to relate this to
> > site and year. Ideally I would have random slopes for airTemp effects
> > for each site because of things like ground water inputs that we
> > don’t measure.
> > 
> 
>   I can imagine that this problem is caused by the size of the
> fixed-effect matrix.  A couple of thoughts (none of them practical, I'm
> afraid):
> 
>   * I was going to say that it's too bad that we haven't yet managed to
> implement a sparse model matrix structure;
>   * then I was going to say that a potential trick/workaround for this
> (for many-level _categorical_ variables) is to treat the factor as a
> random effect, then use devFunOnly/modular structure to fix the theta
> parameter for that variable at a large value, making it a pseudo-fixed
> effect and getting the benefits of (1) a little bit of regularization
> and (2) model matrix sparsity -- but doing this within gamm4 would be
> harder/require more hacking
>   * then I realized that your fixed-effect model matrix probably isn't
> sparse, because it looks like it's made up entirely of continuous covariates
>   * that got me thinking about the fact that some of your continuous
> covariates only vary at higher levels (i.e. Lat/Long and presumably
> Forest, Agriculture, elevation, etc.), and wondering whether there would
> be any way to save space by going back to the underlying model
> formulation and writing this out in terms of another multiplication of
> higher-level covariates times an indicator matrix ...
> 
>   ... all of which is fascinating (to me at least) but none of which
> actually gets you any farther with your specific problem.  Sorry.
> 
>   Ben Bolker
> 

  A couple of further thoughts:

 * if it's true that the size of the fixed-effects model matrix
is your problem, then bam may not help you either (I would guess,
without bothering to check, that it's intended to reduce the size/
increase the sparsity of the spline bases, rather than messing
with the fixed-effect covariates).  
 * you can check whether the fixed-effects matrix is the problem
by running the problem as a straight linear mixed model, i.e. with
lmer either without dOY or with something simple like ns(dOY,10);
you could also try Doug Bates's MixedModels package from Julia.
 * If you were willing to make year a fixed effect you would only
have a single RE, then you might be able to take advantage of
some model structures for increased efficiency (at least until
you incorporate interactions of site with various covariates)