[R-sig-ME] Large data set with several nesting levels and autocorrelated errors

Tue Jul 24 22:09:59 CEST 2012

Dear Lorenz,

Your model requires a distance matrix of 2.69e11 elements (518000 * 518000). The random effects fixes at lot of them at zero, but you still need 24 * 3 * 12 * 8 = 6912 block of each 75 * 75 = 5625 distances. Still totalling about 39M non-zero distances. So you need at least 3 * 39 M = 116M elements for the sparse correlation matrix (269G for a regular matrix). Furthermore 518k responses and 518k residuals. The design matrix for the fixed effects is about 2 * 3 * 8 * 7 * 518k = 174M elements. The design matrix for the random effects is a sparse matrix with 8k non-zero elements. The model holds the original data as wel: 518k * 6 variables)
Assuming 8 byte per element (64 bit system) I get a minimum footprint of 2.2 GB. And that is just to store one copy of the model. So you'll need a multiple of that to calculate the model.

A few suggestions:
1) update R and the packages. There might be some improvement in memory management.
2) keep your session as clean as possible. Use only the objects you can't do without
3) simplify your model. You can simplify the random effect because you use many factors in the fixed part as well (leading to an unindentifiable model)
lme (hemo ~ expCond * housingCond * ns (time, df= 7) * locHead,
        random= ~ 1 | subj/block,
        correlation = corARMA (form= ~ time | subj/block/expCond:locHead, p= 3))
4) using a simpler correlation structure might help as well. To quote Zuur et al (2009) you don't need the best correlation structure. A reasonable one will do.
5) Reduce the amount of data by aggregation.

Best regards,

Thierry

________________________________________
Van: r-sig-mixed-models-bounces op r-project.org [r-sig-mixed-models-bounces op r-project.org] namens lorenz.gygax op art.admin.ch [lorenz.gygax op art.admin.ch]
Verzonden: dinsdag 24 juli 2012 16:06
Aan: r-sig-mixed-models op r-project.org
Onderwerp: [R-sig-ME] Large data set with several nesting levels and    autocorrelated errors

Dear Mixed-Modelers,

we are measuring brain activation in animals by non-invasively tracking brain hemodynamics using near-infrared light. We have measured 24 subjects (half of which are kept in either of two housing conditions) in three experimental conditions each and present the according stimuli 12 times per condition (12 blocks). Each presentation lasts 75 seconds and we get a measurement at 1 Hz for eight measurement channels representing different locations on the head.

This results in 24 * 3 * 12 * 75 * 8 = 518'400 rows in a complete data set (some few stimulations are actually missing due to measurement artifacts resulting in a bit less than 500'000 rows in the actual data set). So far, we have usually calculated block averages across the 12 repetitions per stimulus (using the median at each time point across the 12 repetitions). In the current experiment, we have the impression that this averaging smoothes too much of the patterns that are visible in the raw data and thus, we would like to work with the complete non-averaged data set. We are specifically interested in the shape of the time-course of the signal.

A full model as we have been using would look something like this:

lme (hemo ~ expCond * housingCond * ns (time, df= 7) * locHead,
        random= ~ 1 | subj/expCond/block/locHead,
        weights= corARMA (form= ~ 1 | subj/expCond/block/locHead, p= 3))

>From what the data shows and AR(1) process would usually suffice, but somehow the ARMA model with three parameters seems to lead to more (numerically) stable estimates.

This model unfortunately fails on a 64bit Windows 7 Enterprise machine with 24 GB of RAM due to memory limits (sessionInfo is attached at the end). I have only now noticed that this is not really the most recent version of R (this is a machine at our office and could and should be updated, of course). Would that make a (the?) difference in respect to memory management?

I presume that some other functions like lmer would be more economical on memory use but looking through the archives and the internet sites in respect to R and mixed models, there does not seem to be an alternative function that is more economical and can incorporate temporal correlation in the residuals. Is this correct?

Does anyone of you happen to know whether it is possible to make virtual memory on a Windows machine visible to R? (We have set-up some virtual memory on that machine, but R only recognizes the 24 GB of actual RAM.)

Is it possible to estimate how much memory would be needed?

Any further thoughts on and thoughts on alternative approaches?

Many thanks and best wishes, Lorenz
-
Lorenz Gygax
Centre for proper housing of ruminants and pigs
Tänikon, CH-8356 Ettenhausen / Switzerland

R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=German_Switzerland.1252

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] nlme_3.1-98

loaded via a namespace (and not attached):
[1] grid_2.12.2     lattice_0.19-17 tools_2.12.2

_______________________________________________
R-sig-mixed-models op r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
* * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document.
The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document.