[R-sig-ME] Large data set with several nesting levels and autocorrelated errors

Thu Jul 26 08:51:10 CEST 2012

Dear Thierry,

many thanks for your time and input. 

> Your model requires a distance matrix of 2.69e11 elements (518000 * 518000). The
> random effects fixes at lot of them at zero, but you still need 24 * 3 * 12 * 8 = 6912
> block of each 75 * 75 = 5625 distances. Still totalling about 39M non-zero distances.
> So you need at least 3 * 39 M = 116M elements for the sparse correlation matrix
> (269G for a regular matrix). Furthermore 518k responses and 518k residuals. The
> design matrix for the fixed effects is about 2 * 3 * 8 * 7 * 518k = 174M elements.
> The design matrix for the random effects is a sparse matrix with 8k non-zero
> elements. The model holds the original data as wel: 518k * 6 variables)
> Assuming 8 byte per element (64 bit system) I get a minimum footprint of 2.2 GB.
> And that is just to store one copy of the model. So you'll need a multiple of that to
> calculate the model.

Ok, this would nevertheless imply that - with 24 GB of RAM - a calculation of such a model is not completely off limits.

> A few suggestions:
> 1) update R and the packages. There might be some improvement in memory
> management.

We will certainly do that.

> 2) keep your session as clean as possible. Use only the objects you can't do without

We will also optimise that, though I do hope that I had a proper eye on this issue already.

> 3) simplify your model. You can simplify the random effect because you use many
> factors in the fixed part as well (leading to an unindentifiable model)
> lme (hemo ~ expCond * housingCond * ns (time, df= 7) * locHead,
>         random= ~ 1 | subj/block,
>         correlation = corARMA (form= ~ time | subj/block/expCond:locHead, p= 3))

I do not agree on that. In my view, I am only using the same variables in the random as in the fixed effect as a short cut, but the internally constructed variables are actually different. If we go back to my suggestion:

lme (hemo ~ expCond * housingCond * ns (time, df= 7) * locHead,
        random= ~ 1 | subj/expCond/block/locHead,
        weights= corARMA (form= ~ 1 | subj/expCond/block/locHead, p= 3))

e.g. expCond is a factor with three levels (the experimental conditions) in the fixed effect, but subj/expCond is a factor with 72 levels (the combinations of the 24 subjects with the three experimental conditions). Instead of expCond I could have used the calendar date here because the different experimental conditions were applied on different days.

In my view, I have a proper hierarchically nested model: the time course is reflected by the single observations available within each of the eight location on the head. Eight measurements of these locations on the head were made at the same time (i.e. in the same block), blocks were dependent repeats because they were applied in direct temporal succession on a given day, each day belonged to a specific experimental condition and all subjects were subjected to the different experimental conditions.

I also think that I indeed need to have these levels because the fixed effects are on different hierarchical levels: housing condition is a between subject variable, experimental condition a within subject variable (and between days), the location on the head is a within block variable and the time course is on the level of the single observations.

Last but not least, I had never trouble estimating models with a similar structure in the past and thus they do not seem to be (numerically) unidentifiable and I do not see why they should be because there are repeated measurements in respect to all fixed and random effects.

> 4) using a simpler correlation structure might help as well. To quote Zuur et al (2009)
> you don't need the best correlation structure. A reasonable one will do.

Ok. We should look into this.

> 5) Reduce the amount of data by aggregation.

That is exactly what we have tried so far by block-averaging but, of course, we might need to start think about other ways of aggregation.

Again, many thanks for your thoughts and input which are highly valuable! Lorenz

> ________________________________________
> Van: r-sig-mixed-models-bounces at r-project.org [r-sig-mixed-models-bounces at r-
> project.org] namens lorenz.gygax at art.admin.ch [lorenz.gygax at art.admin.ch]
> Verzonden: dinsdag 24 juli 2012 16:06
> Aan: r-sig-mixed-models at r-project.org
> Onderwerp: [R-sig-ME] Large data set with several nesting levels and
> autocorrelated errors
> 
> Dear Mixed-Modelers,
> 
> we are measuring brain activation in animals by non-invasively tracking brain
> hemodynamics using near-infrared light. We have measured 24 subjects (half of
> which are kept in either of two housing conditions) in three experimental conditions
> each and present the according stimuli 12 times per condition (12 blocks). Each
> presentation lasts 75 seconds and we get a measurement at 1 Hz for eight
> measurement channels representing different locations on the head.
> 
> This results in 24 * 3 * 12 * 75 * 8 = 518'400 rows in a complete data set (some few
> stimulations are actually missing due to measurement artifacts resulting in a bit less
> than 500'000 rows in the actual data set). So far, we have usually calculated block
> averages across the 12 repetitions per stimulus (using the median at each time point
> across the 12 repetitions). In the current experiment, we have the impression that
> this averaging smoothes too much of the patterns that are visible in the raw data and
> thus, we would like to work with the complete non-averaged data set. We are
> specifically interested in the shape of the time-course of the signal.
> 
> A full model as we have been using would look something like this:
> 
> lme (hemo ~ expCond * housingCond * ns (time, df= 7) * locHead,
>         random= ~ 1 | subj/expCond/block/locHead,
>         weights= corARMA (form= ~ 1 | subj/expCond/block/locHead, p= 3))
> 
> >From what the data shows and AR(1) process would usually suffice, but somehow
> the ARMA model with three parameters seems to lead to more (numerically) stable
> estimates.
> 
> This model unfortunately fails on a 64bit Windows 7 Enterprise machine with 24 GB
> of RAM due to memory limits (sessionInfo is attached at the end). I have only now
> noticed that this is not really the most recent version of R (this is a machine at our
> office and could and should be updated, of course). Would that make a (the?)
> difference in respect to memory management?
> 
> I presume that some other functions like lmer would be more economical on
> memory use but looking through the archives and the internet sites in respect to R
> and mixed models, there does not seem to be an alternative function that is more
> economical and can incorporate temporal correlation in the residuals. Is this
> correct?
> 
> Does anyone of you happen to know whether it is possible to make virtual memory
> on a Windows machine visible to R? (We have set-up some virtual memory on that
> machine, but R only recognizes the 24 GB of actual RAM.)
> 
> Is it possible to estimate how much memory would be needed?
> 
> Any further thoughts on and thoughts on alternative approaches?
> 
> Many thanks and best wishes, Lorenz
> -
> Lorenz Gygax
> Centre for proper housing of ruminants and pigs
> Tänikon, CH-8356 Ettenhausen / Switzerland
> 
> 
> R version 2.12.2 (2011-02-25)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
> 
> locale:
> [1] LC_COLLATE=German_Switzerland.1252
> LC_CTYPE=German_Switzerland.1252
> LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Switzerland.1252
> 
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] nlme_3.1-98
> 
> loaded via a namespace (and not attached):
> [1] grid_2.12.2     lattice_0.19-17 tools_2.12.2
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en
> binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door
> een geldig ondertekend document.
> The views expressed in this message and any annex are purely those of the writer
> and may not be regarded as stating an official position of INBO, as long as the
> message is not confirmed by a duly signed document.