[R-sig-ME] Question regarding large data.frame in LMER?

Jad Moawad j@d@mo@w@d @end|ng |rom un||@ch
Mon Dec 14 17:00:41 CET 2020


Thanks a lot everyone for all the suggestions you have provided, I really appreciate it. I have some replies over some comments and will write what have worked so far.



1)    If understood well the comment regarding the duplicates, there was already no id that has the same number twice across different countries and years.

2)    I switched the data.frame from tibble to as.dataframe.

3)    I use now: poly(agecent, degree=2, raw=T) instead of I(age^2).

4)    I tried centering, scaling and/or standardizing my variables but this have not solved the issue.

5)    Regarding the question about how many country_years level I have. I have observations (1,150,110)  that are nested in *both* individuals (472,604) and country-years (180). In other words, they are cross-classified. In turn, individuals and country-years are *both* nested in countries (30). So the data structure is like a diamond, with a point (observations) at the bottom, another point (countries) at the top, and the other two levels in between.

6)    I tried to use glmTMB on a binary outcome, however, I received an error that it directed me to this page: https://cran.r-project.org/web/packages/glmmTMB/vignettes/troubleshooting.html



In the meantime, i received a comment from a statistics professor to change the design of my model. He suggested to do observation_years (level 1) nested within individuals (level 2) nested within countries (level 3). This model seemed to work fine.



In short, I am still not sure what is the main issue. I still have to try the methods suggested by Harold and Cesko. However, I am still not sure how to use buildmer::re2mgcv. If I don�t succeed with these methods, I will stick to the new design.



All the best,



Jad




________________________________
De : Voeten, C.C. <c.c.voeten using hum.leidenuniv.nl>
Envoy� : vendredi 11 d�cembre 2020 11:46:17
� : Harold Doran; Jad Moawad; r-sig-mixed-models using r-project.org
Objet : RE: Question regarding large data.frame in LMER?

Another option could be to try fitting your model using mgcv::bam, which is optimized for 'big' datasets. The function is primarily intended for GAMMs, but an LMM is just a specific type of GAMM, so this is no problem. You could use buildmer::re2mgcv to convert your lme4 random-effects specification (i.e. your formula using | terms) to the equivalent mgcv specification (using s() terms).

However, note that the bam model will not be completely equivalent to the lme4 model, in that the bam model will not model the correlations between the random effects. If those are important to you, then please disregard my suggestion!

Cesko

-----Original Message-----
From: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> On Behalf Of Harold Doran
Sent: Friday, December 11, 2020 11:38 AM
To: Jad Moawad <jad.moawad using unil.ch>; r-sig-mixed-models using r-project.org
Subject: Re: [R-sig-ME] Question regarding large data.frame in LMER?

Assuming that you're sampling from your complete data set in a way that represents the complete data, one strategy might also be to use starting values from prior converged models and incrementally increase the size of the data.

For example,

1) run model with 10% of data and get parameter estimates
2) use the param estimates from (1) as starting values and now increase size of data to 40%
3) repeat

The strategy doesn't help/solve with the p.d. issue, but it does improve the potential for reaching the top of the hill with a big data file faster.

It's an incremental EM idea that reduces the amount of work lmer() (or any iterative maximization procedure) would need to do with a very large file. In other words, why start all over again with a very big file when we can start somewhere better and let the algorithm start closer to the top of the hill, so to speak.

Hope it helps.
Harold



-----Original Message-----
From: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> On Behalf Of Jad Moawad
Sent: Thursday, December 10, 2020 7:12 AM
To: r-sig-mixed-models using r-project.org
Subject: [R-sig-ME] Question regarding large data.frame in LMER?

External email alert: Be wary of links & attachments.


I am working with a large data.frame that contains around 1.4 million observations. Initially when i was running my models, i was working on a sub-sample (10% of my full-sample). This is because running one model can take a lot of time using the original data. Once i was sure that all variables are well harmonized and all regressions were running fine, i ran my models using the full sample. However, the regression did not converge and i received the following two errors from two different models:

Error in fun(xaa, ...) : Downdated VtV is not positive definite

Error in fun(xss, ...) : Downdated VtV is not positive definite

I use the lmer function to fit my model and i include a random slopes at the country and country_year level. Below you find the code that i use.

Model1 <- lmer(health~ class + age + I(age^2)  + class*macro_unemployment +
               (class + age + I(age^2)|country) +
               (class+ age + I(age^2) |country_year) +
               (1|id), data=df)

Model2 <- lmer(health~ education + age + I(age^2)  + education*macro_unemployment+
               (education + age + I(age^2)|country) +
               (education + age + I(age^2) |country_year) +
               (1|id), data=df)


Could someone help me please with solving this issue?

Below you find a glimpse (str) of my data and my sessionInfo():

tibble [1,370,264   8] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ health            : num [1:1370264] 100 100 50 100 0 75 75 100 100 50 ...
 $ class             : Factor w/ 3 levels "Upper-middle class",..: 3 3 NA 3 3 3 3 1 1 3 ...
 $ education         : Factor w/ 3 levels "low","mid","high": 1 1 1 1 1 1 2 3 3 1 ...
 $ age               : num [1:1370264] 24 25 24 25 42 43 34 34 35 58 ...
 $ macro_unemployment: num [1:1370264] 5.24 4.86 5.24 4.86 5.24 ...
 $ id                : int [1:1370264] 2 2 3 3 4 4 6 7 7 8 ...
 $ country_year      : int [1:1370264] 1 2 1 2 1 2 1 1 2 1 ...
 $ country           : Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "groups")= tibble [27   2] (S3: tbl_df/tbl/data.frame)
  ..$ country: Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 6 7 8 9 10 11 12 ...
  ..$ .rows  : list<int> [1:27]
  .. ..$ : int [1:47204] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:41361] 47205 47206 47207 47208 47209 47210 47211 47212 47213 47214 ...
  .. ..$ : int [1:42407] 88566 88567 88568 88569 88570 88571 88572 88573 88574 88575 ...
  .. ..$ : int [1:48253] 130973 130974 130975 130976 130977 130978 130979 130980 130981 130982 ...
  .. ..$ : int [1:31917] 179226 179227 179228 179229 179230 179231 179232 179233 179234 179235 ...
  .. ..$ : int [1:44047] 211143 211144 211145 211146 211147 211148 211149 211150 211151 211152 ...
  .. ..$ : int [1:62087] 255190 255191 255192 255193 255194 255195 255196 255197 255198 255199 ...
  .. ..$ : int [1:94309] 317277 317278 317279 317280 317281 317282 317283 317284 317285 317286 ...
  .. ..$ : int [1:37246] 411586 411587 411588 411589 411590 411591 411592 411593 411594 411595 ...
  .. ..$ : int [1:77253] 448832 448833 448834 448835 448836 448837 448838 448839 448840 448841 ...
  .. ..$ : int [1:16823] 526085 526086 526087 526088 526089 526090 526091 526092 526093 526094 ...
  .. ..$ : int [1:24687] 542908 542909 542910 542911 542912 542913 542914 542915 542916 542917 ...
  .. ..$ : int [1:116263] 567595 567596 567597 567598 567599 567600 567601 567602 567603 567604 ...
  .. ..$ : int [1:43218] 683858 683859 683860 683861 683862 683863 683864 683865 683866 683867 ...
  .. ..$ : int [1:28709] 727076 727077 727078 727079 727080 727081 727082 727083 727084 727085 ...
  .. ..$ : int [1:27583] 755785 755786 755787 755788 755789 755790 755791 755792 755793 755794 ...
  .. ..$ : int [1:77960] 783368 783369 783370 783371 783372 783373 783374 783375 783376 783377 ...
  .. ..$ : int [1:36922] 861328 861329 861330 861331 861332 861333 861334 861335 861336 861337 ...
  .. ..$ : int [1:93194] 898250 898251 898252 898253 898254 898255 898256 898257 898258 898259 ...
  .. ..$ : int [1:9004] 991444 991445 991446 991447 991448 991449 991450 991451 991452 991453 ...
  .. ..$ : int [1:40074] 1000448 1000449 1000450 1000451 1000452 1000453 1000454 1000455 1000456 1000457 ...
  .. ..$ : int [1:29342] 1040522 1040523 1040524 1040525 1040526 1040527 1040528 1040529 1040530 1040531 ...
  .. ..$ : int [1:85124] 1069864 1069865 1069866 1069867 1069868 1069869 1069870 1069871 1069872 1069873 ...
  .. ..$ : int [1:92350] 1154988 1154989 1154990 1154991 1154992 1154993 1154994 1154995 1154996 1154997 ...
  .. ..$ : int [1:50188] 1247338 1247339 1247340 1247341 1247342 1247343 1247344 1247345 1247346 1247347 ...
  .. ..$ : int [1:7598] 1297526 1297527 1297528 1297529 1297530 1297531 1297532 1297533 1297534 1297535 ...
  .. ..$ : int [1:65141] 1305124 1305125 1305126 1305127 1305128 1305129 1305130 1305131 1305132 1305133 ...
  .. ..@ ptype: int(0)
  ..- attr(*, ".drop")= logi TRUE
>


Session Info:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices
[4] utils     datasets  methods
[7] base

other attached packages:
 [1] sessioninfo_1.1.1
 [2] sjlabelled_1.1.5
 [3] varhandle_2.0.5
 [4] labelled_2.7.0
 [5] dplyr_1.0.0
 [6] ggplot2_3.3.2
 [7] forcats_0.5.0
 [8] reprex_0.3.0
 [9] lmerTest_3.1-3
[10] lme4_1.1-25
[11] Matrix_1.2-18

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6
 [2] compiler_4.0.2
 [3] pillar_1.4.4
 [4] nloptr_1.2.2.1
 [5] tools_4.0.2
 [6] digest_0.6.25
 [7] boot_1.3-25
 [8] statmod_1.4.34
 [9] lifecycle_0.2.0
[10] tibble_3.0.1
[11] nlme_3.1-148
[12] gtable_0.3.0
[13] lattice_0.20-41
[14] pkgconfig_2.0.3
[15] rlang_0.4.7
[16] cli_2.0.2
[17] rstudioapi_0.11
[18] haven_2.3.1
[19] withr_2.2.0
[20] hms_0.5.3
[21] generics_0.0.2
[22] vctrs_0.3.1
[23] fs_1.4.1
[24] grid_4.0.2
[25] tidyselect_1.1.0
[26] glue_1.4.1
[27] R6_2.4.1
[28] fansi_0.4.1
[29] minqa_1.2.4
[30] farver_2.0.3
[31] purrr_0.3.4
[32] magrittr_1.5
[33] scales_1.1.1
[34] ellipsis_0.3.1
[35] MASS_7.3-51.6
[36] splines_4.0.2
[37] insight_0.11.0
[38] assertthat_0.2.1
[39] colorspace_1.4-1
[40] numDeriv_2016.8-1.1
[41] labeling_0.3
[42] utf8_1.1.4
[43] munsell_0.5.0
[44] crayon_1.3.4




Sincerely,



Jad Moawad


PhD candidate and teaching assistant
University of Lausanne  - NCCR Lives
Institut des Sciences Sociales
B timent Geopolis - 5621
1015 Lausanne
Switzerland


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

	[[alternative HTML version deleted]]



More information about the R-sig-mixed-models mailing list