[R-sig-ME] GAMM big data (70K rand effects) guidance

Thu May 7 17:31:52 CEST 2015

Hi all,

I am working with an patient data base of 70K HIV-infected individuals followed over time since treatment initiation, with 500K total observations that include a laboratory measurement (CD4 cell count—an indicator of immunocompetence). I’m trying to use GAMM to model the CD4 trajectory as a function of CD4 at treatment initiation (i.e. y-intercept) and other covariate classes (sex, age, etc). Thus, far I’ve struggled to fit GAMMs to the entire data set.

I’m using a gaussian link function to log(CD4+1) for now. With gamm, this gives the following:

> form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init, bs=“tp")')
> print(system.time(tg1 <- gamm(form, data = nd, order.groups=F, family=gaussian, random=list(PatientID=~1))))

where ayfu is time since treatment initiation and CD4_cat_init is the CD4 count at treatment initiation broken into 5 categories.

I ran that on a large memory (1TB) node on our HPC cluster and, after 12 hours using between 300-500 GB of memory, it crashed:

> Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups = F,  : 
>   error in evaluating the argument 'x' in selecting a method for function 'print': Error in cbind(X1, X[[i]][, j] * X0) : 
>   long vectors not supported yet: bind.c:1301
> Calls: system.time ... extract.lme.cov2 -> cbind -> tensor.prod.model.matrix -> cbind

Google tells me that this has to do with limits on R’s array size. But I don’t totally follow how that is interacting with the gamm call.

I’m now trying out cubic regression splines (bs=‘cs’ instead of ‘tp’) with gamm and also with gamm4. Running the code on subsets of the data (1K individuals) suggest only a mild improvement by using ‘cs’ for both packages, and a *decrease* in speed using gamm4 instead of gamm. The latter surprises me since I had thought that gamm4 was meant to be faster when the # of random effects was large.

Eventually I’d like to use smoother-by-group interactions other than the CD4_cat_init (i.e. sex, age etc) and test whether trajectories are significantly different between covariate classes using AIC. It would also be nice to somehow characterize how variable individuals’ trends are within a covariate class, though I’m not exactly sure what’s the best way to do that.

But until I can get just one of these models to fit, these goals seem like a long shot. I’ve struggled to find much documentation online regarding fitting GAMMs to such large data sets, particularly one with so many random effects. Hence the trial and error exploration of different splines & packages. Does anyone have more concrete guidance on how to approach this problem or helpful documentation? Help much appreciated!

Thanks,

Steve

Steve Bellan, PhD, MPH
Post-doctoral Researcher
Lauren Ancel Meyers Research Group
Center for Computational Biology and Bioinformatics
University of Texas at Austin
http://www.bio.utexas.edu/research/meyers/steve_bellan/ <http://www.bio.utexas.edu/research/meyers/steve_bellan/>

	[[alternative HTML version deleted]]