[R-sig-ME] GAMM big data (70K rand effects) guidance

Ken Beath ken.beath at mq.edu.au
Sat May 9 12:43:35 CEST 2015


Long vectors i.e. vectors of about length 2^31 or longer have been
progressively added to R, but not everywhere. In this case I think it is
because an array can't yet have a bounds of this size, although a vector
can.

> x <- 1
> x[2^31] <- 1
> y <- 1
> y[2^31] <- 1
> z <- cbind(x,y)
Error in cbind(x, y) :
  long vectors not supported yet:
../../../../R-3.1.3/src/include/Rinlinedfuns.h:137

Even if this is fixed the C and Fortran code in many packages would need to
be modified as well.


On 8 May 2015 at 01:31, Steve Bellan <steve.bellan at gmail.com> wrote:

> Hi all,
>
> I am working with an patient data base of 70K HIV-infected individuals
> followed over time since treatment initiation, with 500K total observations
> that include a laboratory measurement (CD4 cell count—an indicator of
> immunocompetence). I’m trying to use GAMM to model the CD4 trajectory as a
> function of CD4 at treatment initiation (i.e. y-intercept) and other
> covariate classes (sex, age, etc). Thus, far I’ve struggled to fit GAMMs to
> the entire data set.
>
> I’m using a gaussian link function to log(CD4+1) for now. With gamm, this
> gives the following:
>
> > form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init,
> bs=“tp")')
> > print(system.time(tg1 <- gamm(form, data = nd, order.groups=F,
> family=gaussian, random=list(PatientID=~1))))
>
> where ayfu is time since treatment initiation and CD4_cat_init is the CD4
> count at treatment initiation broken into 5 categories.
>
> I ran that on a large memory (1TB) node on our HPC cluster and, after 12
> hours using between 300-500 GB of memory, it crashed:
>
> > Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups =
> F,  :
> >   error in evaluating the argument 'x' in selecting a method for
> function 'print': Error in cbind(X1, X[[i]][, j] * X0) :
> >   long vectors not supported yet: bind.c:1301
> > Calls: system.time ... extract.lme.cov2 -> cbind ->
> tensor.prod.model.matrix -> cbind
>
> Google tells me that this has to do with limits on R’s array size. But I
> don’t totally follow how that is interacting with the gamm call.
>
> I’m now trying out cubic regression splines (bs=‘cs’ instead of ‘tp’) with
> gamm and also with gamm4. Running the code on subsets of the data (1K
> individuals) suggest only a mild improvement by using ‘cs’ for both
> packages, and a *decrease* in speed using gamm4 instead of gamm. The latter
> surprises me since I had thought that gamm4 was meant to be faster when the
> # of random effects was large.
>
> Eventually I’d like to use smoother-by-group interactions other than the
> CD4_cat_init (i.e. sex, age etc) and test whether trajectories are
> significantly different between covariate classes using AIC. It would also
> be nice to somehow characterize how variable individuals’ trends are within
> a covariate class, though I’m not exactly sure what’s the best way to do
> that.
>
> But until I can get just one of these models to fit, these goals seem like
> a long shot. I’ve struggled to find much documentation online regarding
> fitting GAMMs to such large data sets, particularly one with so many random
> effects. Hence the trial and error exploration of different splines &
> packages. Does anyone have more concrete guidance on how to approach this
> problem or helpful documentation? Help much appreciated!
>
> Thanks,
>
> Steve
>
> Steve Bellan, PhD, MPH
> Post-doctoral Researcher
> Lauren Ancel Meyers Research Group
> Center for Computational Biology and Bioinformatics
> University of Texas at Austin
> http://www.bio.utexas.edu/research/meyers/steve_bellan/ <
> http://www.bio.utexas.edu/research/meyers/steve_bellan/>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>



-- 

*Ken Beath*
Lecturer
Statistics Department
MACQUARIE UNIVERSITY NSW 2109, Australia

Phone: +61 (0)2 9850 8516

Building E4A, room 526
http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/

CRICOS Provider No 00002J
This message is intended for the addressee named and may...{{dropped:9}}



More information about the R-sig-mixed-models mailing list