[R-sig-ME] Question regarding large data.frame in LMER?

Tue Dec 15 18:07:12 CET 2020

This is an outstanding, reply and hugely helpful post. Phillip, you have done good work. It will take me a bit of time to absorb all of this and may have some future thoughts/questions. 

My software for mixed models dances between R and C++ as Doug notes is a bit of work. I fear my representation of the Henderson mixed model solution differs drastically from how lmer (or Julia variants not sure what to refer to those as), but it stands on the shoulders of Davis' work and Bates/Maechler implementations in Matrix package. In profiling my applications, the simple symbolic/numeric Cholesky decompositions (for me) were huge improvements.

I have two "tricks" I use in my solution that have improved speed. One is using the woodbury identity lemma. There is one place where I found I *had* to take an inverse and I was able to reduce the size of that matrix using that lemma.

I also found some smaller places where I could parallelize some data management tasks within the code, but not inside the real workhorse--I found that a harder problem and gave up on that front. 

These are hard problems.

-----Original Message-----
From: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> On Behalf Of Phillip Alday
Sent: Tuesday, December 15, 2020 11:12 AM
To: Harold Doran <harold.doran using cambiumassessment.com>; Jad Moawad <jad.moawad using unil.ch>; Douglas Bates <bates using stat.wisc.edu>; Voeten, C.C. <c.c.voeten using hum.leidenuniv.nl>
Cc: r-sig-mixed-models using r-project.org
Subject: Re: [R-sig-ME] Question regarding large data.frame in LMER?

External email alert: Be wary of links & attachments.

Hi everybody,

since I'm the one who wrote the limited parallelized code in MixedModels.jl, I'll comment a little bit on that as well as a few general points.

1. An efficient profiled loglikelihood is important to both lme4 and MixedModels.jl. Critically, both manage to rewrite the problem as a penalized least squares problem instead of the more typical generalized least squares formulation used by e.g. nlme.  This has advantages both in computation and representation (next point). For that, see the lme4 paper or the NextJournal article Doug linked or .... when I actually can block out some time from the day job and finish my small part of it, the forthcoming MixedModels.jl paper. (Sorry, Doug! I haven't forgotten about it...)

2. Sparse matrices are important to both lme4 and MixedModels.jl not just because of the direct issues with a reduced memory footprint. It turns out that memory allocation itself is really slow, even when you have enough memory, so if you can do sparse operations in place (rather than allocating and copying), then you can make things really fast. Joel on Software discusses this issue in one his posts (https://www.joelonsoftware.com/2001/12/11/back-to-basics/). lme4 does this to some extent, but requires bouncing back and forth between R and
C++ to really it do it well, but that introduces its own complexities.

So that brings us to where Julia shines compared to R: you do everything in Julia and Julia's memory model is such that it's easier to do things in place. So in Julia we can write prototype very specialized methods for particular types of sparsity and do all of those sparse operations in place, without bringing in another language. In other words, it's easier to try new stuff out, which is how some of the innovations in MixedModels.jl arose compared to lme4 -- Doug tried prototyped a particular specialized quickly because it was easy to do so and it proved worthwhile. (There were also a few happy accidents in deriving a somewhat more efficient profiled log-likelihood, both in computation and in storage, but back porting those changes to lme4 is somewhat more difficult due to the two-language problem.)

@Doug: feel free to correct any infelicities or oversimplifications that I probably introduced.

On the parallelism front: there are may different levels to be parallel at. The easiest and most obvious one is at the level of the basic matrix operations, i.e. those provided by the BLAS (basic linear algebra subroutines). Microsoft/Revolution's R release actually packages the Intel MKL, a very specialized and fast BLAS, instead of the default system BLAS that R uses on most platforms. (On MacOS, the system BLAS is actually very good, at least on Intel chips). There are decades of research into making BLAS work fast, including its parallelization. I frequently see all cores being saturated when doing BLAS calls from Julia because the OpenBLAS variant Julia links to is multi-threaded by default.  On Linux, you change the BLAS that R uses to be something more tuned to your system, if you want; here's my somewhat outdated blogpost on it: https://phillipalday.com/blog/2013/05/31/Speed-up-R-(on-Linux)/

In Julia, I've tried to parallelize some loops in the evaluation of the log likelihood but ... it generally didn't actually help much. This suggests that those weren't the bottleneck and/or parallelization introduces a different bottleneck (more on that in a minute). lme4 currently has an open issue on using an optimizer that can itself be parallelized (i.e the thing that takes the log likelihood function and uses that to find out the parameters values corresponding to the maximum
likelihood): https://github.com/lme4/lme4/issues/508 but progress on that has kinda stalled. It doesn't seem to offer a huge boost; parallel programming is hard; etc. Both lme4 and MixedModels.jl are written in such a way that it is theoretically possible to use a parallelized optimizer, but it just doesn't seem to be worth it.

One thing that should be worth parallelizing though is the parametric bootstrap because each bootstrap replication is, per definition, independent of the others (it's an "embarrassingly parallel" problem). I wrote this for MixedModels.jl -- if you want this to be really nice, then there is some subtlety in dealing with the random number generator and not making superfluous memory allocations (see above), but it's relatively straightforward as far as parallel programming goes. When I first implemented it, we noticed a huge speed boost. Then Doug made some algorithmic improvements for the single-threaded case and a new Julia version came out that was linked to a new BLAS version and the boost went away, sometimes running even slower. The problem? The new BLAS version was very multithreaded and so it was already taking full advantage of all the cores on my system for a single model fit, so when I added another layer of parallelism on top of that, I suddenly had more parallel operations than cores and swapping between operations (context
switching) is not free. So even then it didn't help. I tried increasing the granularity of the parallelism ... and that helped a bit, but added a huge amount of code complexity, so I ultimately discarded that attempt. (I think lme4 can still take better advantage of this in its bootstrap because the default R BLAS isn't quite as good as the current Julia BLAS for multithreading.) If you're curious and bored, I talked about this as part of JuliaCon this year:
https://www.youtube.com/watch?v=qhfQKIDR7Ms

Of course, this is all parallelism at the level of multiple local CPUs/cores -- with things like clusters with multiple nodes (which require some coordination overhead) or  GPUs (which don't always place nice with sparse matrices and introduce their own coordination overhead), you can introduce some more parallelism, so there may still be some gains to be made in this area. Also, if you have a system with just an insane number of cores (> 64), then you probably have more cores than your BLAS can take advantage of and then layering local threaded bootstrapping on top of that may still help. (Anybody with a Threadripper CPU want to run some benchmarks for me?)

There's also Amdah's Law (https://en.wikipedia.org/wiki/Amdahl's_law),
which puts a hard limit to how much of a boost parallelism can give you.
(Though for things like bootstrapping, if you had zero coordination cost, then you do get linear scaling until the number of cluster nodes matches the number of bootstrap replicates...)

In other words, parallelism is hard and not quite as helpful as you'd hope. (And I haven't even talked about SIMD-type of parallelism, which is a bit closer to GPU-style parallelism but on a smaller scale and part of the floating-point unit of most modern CPUs.)

This doesn't mean there are no more performance enhancements to be made.
Quite the contrary -- if we can continue improving and specializing our storage and computation methods, then we can often reduce the total number of (multithreaded) BLAS operations necessary, which is far more effective than just making those operations each individually faster.
And these advancements are exactly (1) and (2) above and in Harold's message.

As a contrast, Stan has seen some recent improvements in speed by using more types of parallelism within chains, but these are related to (A) the types of BLAS-level parallelism I mentioned above and (B) taking advantage of some nice associativity tricks in summing up the individual terms of log-likelihoods. These types of algorithm advances are largely already present in MixedModels.jl, just in very specialized form. (I think the Stan team is doing some really cool work, but the pro  and con of Stan or general frameworks like TMB is that they are general, so they can't take advantage of some specializations very particular to a given model class, like mixed models.)

Best,
Phillip

PS: I liked the question! This is not meant to be dismissive, but instead to highlight how hard it all is. :)

On 15/12/20 4:14 pm, Harold Doran wrote:
> @doug, it seems like the "speed" conversation for mixed models have two considerations: 1) how can we rewrite the log-likelihood to improve optimization and 2) what computational things can be done to improve the speed.
>
> As an example, your note below is more along the lines of (1) whereas sparse matrices are more along the lines of (2).
>
> One computational topic that I know is harder with LMMs is parallel processing. Almost all software I have written in the past 5 years uses parallel processing-but my tasks are based on the concept of being "embarrassingly parallel", which makes the split, apply, combine problem quite easy.
>
> Parallel processing with larger matrix operations is more complex. However, I have implemented lmer models using Microsoft's R implementation and timed it against R and found it was quite a bit faster. Same data, same computer, same model, same output (and same results). I *think* it's more than just a different BLAS being used by MS, but perhaps something fundamentally different in its workhorse for some matrix ops. Others correct me if I'm wrong, but I'm curious on your thoughts Doug as whether this is an avenue mixed model thinkers/developers should consider (being agnostic for a moment about which software is used).
>
>
>
> From: Jad Moawad <jad.moawad using unil.ch>
> Sent: Monday, December 14, 2020 11:38 AM
> To: Douglas Bates <bates using stat.wisc.edu>; Voeten, C.C. 
> <c.c.voeten using hum.leidenuniv.nl>
> Cc: Harold Doran <harold.doran using cambiumassessment.com>; 
> r-sig-mixed-models using r-project.org
> Subject: RE: [R-sig-ME] Question regarding large data.frame in LMER?
>
>
> Sorry, I did not know that i had to keep poly(Age, 2) as default. Thank you for clarifying, I will try it again.
>
>
>
> The number of levels in the id variable is 567058. People are supposedly followed for 4 years, but of course there is some attrition along the way.
>
>
>
> Yes please, i would like to try with the package of Julia. After all, i am using now the second-best option in terms of design. However, if i can go back to the original, it would be better to answer my research question.
>
> ________________________________
> De : Douglas Bates <bates using stat.wisc.edu<mailto:bates using stat.wisc.edu>>
> Envoy  : lundi 14 d cembre 2020 17:09:15   : Voeten, C.C.
> Cc : Harold Doran; Jad Moawad; 
> r-sig-mixed-models using r-project.org<mailto:r-sig-mixed-models using r-project.o
> rg> Objet : Re: [R-sig-ME] Question regarding large data.frame in 
> LMER?
>
> Sorry for coming to this discussion late.  I notice that in the glimpse of the data set the first few values of id occur in pairs.  If you convert id to a factor how many levels does it have?
>
> The implementation of mixed-effects models in the MixedModels (https://github.com/JuliaStats/MixedModels.jl) package for Julia is better suited to dealing with large numbers of observations than is the one in lme4.  It uses a more compact numerical representation of the model during the iterations to optimize the profiled log-likelihood.  Some of this is described in https://nextjournal.com/dmbates/complexity-in-fitting-linear-mixed-models  The important point is that evaluation of the profiled log-likelihood for new parameter values does not require working with matrices whose size depends on n, the number of observations.  All matrices are of sizes determined by the number of random effects and the number of fixed-effects parameters.  Of course, if each id only occurs twice then the number of random effects is already on the order of the number of observations so this will not be an advantage.
>
> If you are interested in trying the Julia package we can help with that.
>
> If anyone wants to try to implement the MixedModels approach in R I can advise you how to do it but I really don't have the energy to do the R/C++ two-language dance any more.
>
> On Fri, Dec 11, 2020 at 4:48 AM Voeten, C.C. <c.c.voeten using hum.leidenuniv.nl<mailto:c.c.voeten using hum.leidenuniv.nl>> wrote:
> Another option could be to try fitting your model using mgcv::bam, which is optimized for 'big' datasets. The function is primarily intended for GAMMs, but an LMM is just a specific type of GAMM, so this is no problem. You could use buildmer::re2mgcv to convert your lme4 random-effects specification (i.e. your formula using | terms) to the equivalent mgcv specification (using s() terms).
>
> However, note that the bam model will not be completely equivalent to the lme4 model, in that the bam model will not model the correlations between the random effects. If those are important to you, then please disregard my suggestion!
>
> Cesko
>
> -----Original Message-----
> From: R-sig-mixed-models 
> <r-sig-mixed-models-bounces using r-project.org<mailto:r-sig-mixed-models-bo
> unces using r-project.org>> On Behalf Of Harold Doran
> Sent: Friday, December 11, 2020 11:38 AM
> To: Jad Moawad <jad.moawad using unil.ch<mailto:jad.moawad using unil.ch>>; 
> r-sig-mixed-models using r-project.org<mailto:r-sig-mixed-models using r-project.o
> rg>
> Subject: Re: [R-sig-ME] Question regarding large data.frame in LMER?
>
> Assuming that you're sampling from your complete data set in a way that represents the complete data, one strategy might also be to use starting values from prior converged models and incrementally increase the size of the data.
>
> For example,
>
> 1) run model with 10% of data and get parameter estimates
> 2) use the param estimates from (1) as starting values and now 
> increase size of data to 40%
> 3) repeat
>
> The strategy doesn't help/solve with the p.d. issue, but it does improve the potential for reaching the top of the hill with a big data file faster.
>
> It's an incremental EM idea that reduces the amount of work lmer() (or any iterative maximization procedure) would need to do with a very large file. In other words, why start all over again with a very big file when we can start somewhere better and let the algorithm start closer to the top of the hill, so to speak.
>
> Hope it helps.
> Harold
>
>
>
> -----Original Message-----
> From: R-sig-mixed-models 
> <r-sig-mixed-models-bounces using r-project.org<mailto:r-sig-mixed-models-bo
> unces using r-project.org>> On Behalf Of Jad Moawad
> Sent: Thursday, December 10, 2020 7:12 AM
> To: 
> r-sig-mixed-models using r-project.org<mailto:r-sig-mixed-models using r-project.o
> rg>
> Subject: [R-sig-ME] Question regarding large data.frame in LMER?
>
> External email alert: Be wary of links & attachments.
>
>
> I am working with a large data.frame that contains around 1.4 million observations. Initially when i was running my models, i was working on a sub-sample (10% of my full-sample). This is because running one model can take a lot of time using the original data. Once i was sure that all variables are well harmonized and all regressions were running fine, i ran my models using the full sample. However, the regression did not converge and i received the following two errors from two different models:
>
> Error in fun(xaa, ...) : Downdated VtV is not positive definite
>
> Error in fun(xss, ...) : Downdated VtV is not positive definite
>
> I use the lmer function to fit my model and i include a random slopes at the country and country_year level. Below you find the code that i use.
>
> Model1 <- lmer(health~ class + age + I(age^2)  + class*macro_unemployment +
>                (class + age + I(age^2)|country) +
>                (class+ age + I(age^2) |country_year) +
>                (1|id), data=df)
>
> Model2 <- lmer(health~ education + age + I(age^2)  + education*macro_unemployment+
>                (education + age + I(age^2)|country) +
>                (education + age + I(age^2) |country_year) +
>                (1|id), data=df)
>
>
> Could someone help me please with solving this issue?
>
> Below you find a glimpse (str) of my data and my sessionInfo():
>
> tibble [1,370,264   8] (S3: grouped_df/tbl_df/tbl/data.frame)
>  $ health            : num [1:1370264] 100 100 50 100 0 75 75 100 100 50 ...
>  $ class             : Factor w/ 3 levels "Upper-middle class",..: 3 3 NA 3 3 3 3 1 1 3 ...
>  $ education         : Factor w/ 3 levels "low","mid","high": 1 1 1 1 1 1 2 3 3 1 ...
>  $ age               : num [1:1370264] 24 25 24 25 42 43 34 34 35 58 ...
>  $ macro_unemployment: num [1:1370264] 5.24 4.86 5.24 4.86 5.24 ...
>  $ id                : int [1:1370264] 2 2 3 3 4 4 6 7 7 8 ...
>  $ country_year      : int [1:1370264] 1 2 1 2 1 2 1 1 2 1 ...
>  $ country           : Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
>  - attr(*, "groups")= tibble [27   2] (S3: tbl_df/tbl/data.frame)
>   ..$ country: Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 6 7 8 9 10 11 12 ...
>   ..$ .rows  : list<int> [1:27]
>   .. ..$ : int [1:47204] 1 2 3 4 5 6 7 8 9 10 ...
>   .. ..$ : int [1:41361] 47205 47206 47207 47208 47209 47210 47211 47212 47213 47214 ...
>   .. ..$ : int [1:42407] 88566 88567 88568 88569 88570 88571 88572 88573 88574 88575 ...
>   .. ..$ : int [1:48253] 130973 130974 130975 130976 130977 130978 130979 130980 130981 130982 ...
>   .. ..$ : int [1:31917] 179226 179227 179228 179229 179230 179231 179232 179233 179234 179235 ...
>   .. ..$ : int [1:44047] 211143 211144 211145 211146 211147 211148 211149 211150 211151 211152 ...
>   .. ..$ : int [1:62087] 255190 255191 255192 255193 255194 255195 255196 255197 255198 255199 ...
>   .. ..$ : int [1:94309] 317277 317278 317279 317280 317281 317282 317283 317284 317285 317286 ...
>   .. ..$ : int [1:37246] 411586 411587 411588 411589 411590 411591 411592 411593 411594 411595 ...
>   .. ..$ : int [1:77253] 448832 448833 448834 448835 448836 448837 448838 448839 448840 448841 ...
>   .. ..$ : int [1:16823] 526085 526086 526087 526088 526089 526090 526091 526092 526093 526094 ...
>   .. ..$ : int [1:24687] 542908 542909 542910 542911 542912 542913 542914 542915 542916 542917 ...
>   .. ..$ : int [1:116263] 567595 567596 567597 567598 567599 567600 567601 567602 567603 567604 ...
>   .. ..$ : int [1:43218] 683858 683859 683860 683861 683862 683863 683864 683865 683866 683867 ...
>   .. ..$ : int [1:28709] 727076 727077 727078 727079 727080 727081 727082 727083 727084 727085 ...
>   .. ..$ : int [1:27583] 755785 755786 755787 755788 755789 755790 755791 755792 755793 755794 ...
>   .. ..$ : int [1:77960] 783368 783369 783370 783371 783372 783373 783374 783375 783376 783377 ...
>   .. ..$ : int [1:36922] 861328 861329 861330 861331 861332 861333 861334 861335 861336 861337 ...
>   .. ..$ : int [1:93194] 898250 898251 898252 898253 898254 898255 898256 898257 898258 898259 ...
>   .. ..$ : int [1:9004] 991444 991445 991446 991447 991448 991449 991450 991451 991452 991453 ...
>   .. ..$ : int [1:40074] 1000448 1000449 1000450 1000451 1000452 1000453 1000454 1000455 1000456 1000457 ...
>   .. ..$ : int [1:29342] 1040522 1040523 1040524 1040525 1040526 1040527 1040528 1040529 1040530 1040531 ...
>   .. ..$ : int [1:85124] 1069864 1069865 1069866 1069867 1069868 1069869 1069870 1069871 1069872 1069873 ...
>   .. ..$ : int [1:92350] 1154988 1154989 1154990 1154991 1154992 1154993 1154994 1154995 1154996 1154997 ...
>   .. ..$ : int [1:50188] 1247338 1247339 1247340 1247341 1247342 1247343 1247344 1247345 1247346 1247347 ...
>   .. ..$ : int [1:7598] 1297526 1297527 1297528 1297529 1297530 1297531 1297532 1297533 1297534 1297535 ...
>   .. ..$ : int [1:65141] 1305124 1305125 1305126 1305127 1305128 1305129 1305130 1305131 1305132 1305133 ...
>   .. ..@ ptype: int(0)
>   ..- attr(*, ".drop")= logi TRUE
>>
>
>
> Session Info:
>
> R version 4.0.2 (2020-06-22)
> Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 
> Catalina 10.15.6
>
> Matrix products: default
> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
> LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.
> dylib
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices
> [4] utils     datasets  methods
> [7] base
>
> other attached packages:
>  [1] sessioninfo_1.1.1
>  [2] sjlabelled_1.1.5
>  [3] varhandle_2.0.5
>  [4] labelled_2.7.0
>  [5] dplyr_1.0.0
>  [6] ggplot2_3.3.2
>  [7] forcats_0.5.0
>  [8] reprex_0.3.0
>  [9] lmerTest_3.1-3
> [10] lme4_1.1-25
> [11] Matrix_1.2-18
>
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.4.6
>  [2] compiler_4.0.2
>  [3] pillar_1.4.4
>  [4] nloptr_1.2.2.1
>  [5] tools_4.0.2
>  [6] digest_0.6.25
>  [7] boot_1.3-25
>  [8] statmod_1.4.34
>  [9] lifecycle_0.2.0
> [10] tibble_3.0.1
> [11] nlme_3.1-148
> [12] gtable_0.3.0
> [13] lattice_0.20-41
> [14] pkgconfig_2.0.3
> [15] rlang_0.4.7
> [16] cli_2.0.2
> [17] rstudioapi_0.11
> [18] haven_2.3.1
> [19] withr_2.2.0
> [20] hms_0.5.3
> [21] generics_0.0.2
> [22] vctrs_0.3.1
> [23] fs_1.4.1
> [24] grid_4.0.2
> [25] tidyselect_1.1.0
> [26] glue_1.4.1
> [27] R6_2.4.1
> [28] fansi_0.4.1
> [29] minqa_1.2.4
> [30] farver_2.0.3
> [31] purrr_0.3.4
> [32] magrittr_1.5
> [33] scales_1.1.1
> [34] ellipsis_0.3.1
> [35] MASS_7.3-51.6
> [36] splines_4.0.2
> [37] insight_0.11.0
> [38] assertthat_0.2.1
> [39] colorspace_1.4-1
> [40] numDeriv_2016.8-1.1
> [41] labeling_0.3
> [42] utf8_1.1.4
> [43] munsell_0.5.0
> [44] crayon_1.3.4
>
>
>
>
> Sincerely,
>
>
>
> Jad Moawad
>
>
> PhD candidate and teaching assistant
> University of Lausanne  - NCCR Lives
> Institut des Sciences Sociales
> B timent Geopolis - 5621
> 1015 Lausanne
> Switzerland
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models using r-project.org<mailto:R-sig-mixed-models using r-project.o
> rg> mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> _______________________________________________
> R-sig-mixed-models using r-project.org<mailto:R-sig-mixed-models using r-project.o
> rg> mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>       [[alternative HTML version deleted]]
>
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models