[R-sig-ME] GAMM big data (70K rand effects) guidance

Fri May 8 09:44:01 CEST 2015

Dear Steven,

It looks like you have on average about 7 measurements per patient. Are
they taken at somewhat fixed timepoints? E.g. something like ayfu = seq(0,
by = 50, length = 7) + rnorm(7, sd = 2). In that case you could consider to
make the model less complex and treat ayfu as categorical. You won't loose
that much information. Then the model becomes lmer(log(cd4 + 1) ~sex +
timepoint * CD4 + (1|PatientID)). I recommend renaming CD4. Having CD4 and
cd4 in the samen dataset is confusing.

Another thing that I would try is to limit to complexity of the smoother be
setting k = 3.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-05-07 17:31 GMT+02:00 Steve Bellan <steve.bellan op gmail.com>:

> Hi all,
>
> I am working with an patient data base of 70K HIV-infected individuals
> followed over time since treatment initiation, with 500K total observations
> that include a laboratory measurement (CD4 cell count—an indicator of
> immunocompetence). I’m trying to use GAMM to model the CD4 trajectory as a
> function of CD4 at treatment initiation (i.e. y-intercept) and other
> covariate classes (sex, age, etc). Thus, far I’ve struggled to fit GAMMs to
> the entire data set.
>
> I’m using a gaussian link function to log(CD4+1) for now. With gamm, this
> gives the following:
>
> > form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init,
> bs=“tp")')
> > print(system.time(tg1 <- gamm(form, data = nd, order.groups=F,
> family=gaussian, random=list(PatientID=~1))))
>
> where ayfu is time since treatment initiation and CD4_cat_init is the CD4
> count at treatment initiation broken into 5 categories.
>
> I ran that on a large memory (1TB) node on our HPC cluster and, after 12
> hours using between 300-500 GB of memory, it crashed:
>
> > Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups =
> F,  :
> >   error in evaluating the argument 'x' in selecting a method for
> function 'print': Error in cbind(X1, X[[i]][, j] * X0) :
> >   long vectors not supported yet: bind.c:1301
> > Calls: system.time ... extract.lme.cov2 -> cbind ->
> tensor.prod.model.matrix -> cbind
>
> Google tells me that this has to do with limits on R’s array size. But I
> don’t totally follow how that is interacting with the gamm call.
>
> I’m now trying out cubic regression splines (bs=‘cs’ instead of ‘tp’) with
> gamm and also with gamm4. Running the code on subsets of the data (1K
> individuals) suggest only a mild improvement by using ‘cs’ for both
> packages, and a *decrease* in speed using gamm4 instead of gamm. The latter
> surprises me since I had thought that gamm4 was meant to be faster when the
> # of random effects was large.
>
> Eventually I’d like to use smoother-by-group interactions other than the
> CD4_cat_init (i.e. sex, age etc) and test whether trajectories are
> significantly different between covariate classes using AIC. It would also
> be nice to somehow characterize how variable individuals’ trends are within
> a covariate class, though I’m not exactly sure what’s the best way to do
> that.
>
> But until I can get just one of these models to fit, these goals seem like
> a long shot. I’ve struggled to find much documentation online regarding
> fitting GAMMs to such large data sets, particularly one with so many random
> effects. Hence the trial and error exploration of different splines &
> packages. Does anyone have more concrete guidance on how to approach this
> problem or helpful documentation? Help much appreciated!
>
> Thanks,
>
> Steve
>
> Steve Bellan, PhD, MPH
> Post-doctoral Researcher
> Lauren Ancel Meyers Research Group
> Center for Computational Biology and Bioinformatics
> University of Texas at Austin
> http://www.bio.utexas.edu/research/meyers/steve_bellan/ <
> http://www.bio.utexas.edu/research/meyers/steve_bellan/>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]