[Bioc-devel] Might be worth following thread on R mailing lists

Thu Oct 17 17:13:23 CEST 2019

Hi
I spotted this thread on the R-mailing list and thought it might be worth following as it might impact Bioc-Tidyverse users
Aedin

Message: 5
Date: Thu, 17 Oct 2019 10:28:45 +0100
From: Jocelyn Ireson-Paine<jocpaine using googlemail.com>
To:r-help using r-project.org
Subject: [R] Surprisingly large amount of memory used by tibble with
	lots of nested tibbles within
Message-ID:
	<CAOdhVrFhJpuzoXcqCYyjEMWGtU=hcW84N2xA43dgqTkX5vFxxQ using mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I'm using the Tidyverse group_nest() function to nest data about families
and people within households, and have found that this seems to use
astonishing quantities of memory. It's more than I'd expect from the number
of nested tibbles created. I'll outline what was happening with my actual
data, then show a reproducible example.

My data comes from the British Labour Force Survey. It's a flat file,
storable as CSV, representing households. Each record represents a person,
with variables such as age, sex, income, health, employment. Records have a
household ID, which groups them into households; and a family ID, grouping
into families within a household. A fair number of households have more
than one family; and many families have more than one person.

So the data has a hierarchical structure of people within families within
households. I need to process it at all three levels. Some benefit
calculations, for example, need doing per person; but I also need to
aggregate over families and households. It seemed obvious that using
group_nest() would make this easy. Indeed it does, but at the expense of
memory. Whereas one file of 4000 households and about 40 variables occupies
1.44 MB unnested, it blows up to 19.8 MB once double-nested so people are
within families are within households.

Also surprising is that the nesting makes saveRDS() much much slower.
Saving the unnested data is almost instantaneous; saving the nested takes
over 10 minutes. Reading it is also slow.

I'll now show my reproducible example. For this, I created a 10,000-row
tibble with 15 data columns, all generated by runif() . I then added a
grouping column to indicate which rows could be regarded as in the same
group. This can be varied, so I can have every row in its own group, or all
rows in the same group, or somewhere in between. I then called group_nest()
on this and looked at the memory used by the result. Actually, I did this
inside a function, and called it with different numbers of groups, to see
how memory usage varied with number of nested tibbles. Each row's group ID
was created by remaindering (via %%) on its row number.

First, my source:

library( tidyverse )
library( pryr )
library( lobstr )
library( glue )
library( microbenchmark )
library( assertthat )

investigate_nesting_effect <- function( len, ngroups )
{
   t <- tibble( id=1:len
              , a=runif(len), b=runif(len), c=runif(len), d=runif(len),
e=runif(len)
              , f=runif(len), g=runif(len), h=runif(len), i=runif(len),
j=runif(len)
              , k=runif(len), l=runif(len), m=runif(len), n=runif(len),
o=runif(len)
              )

   t $ group_id <- 1:len %% ngroups

   tg <- t %>% group_by( group_id )

   tgn <- tg %>% group_nest( keep=FALSE )

   tgnun <- tgn %>% unnest()

   assert_that( are_equal( t, tgnun, tol=0.001 ) )

   print( glue( "Length={len}, ngroups={ngroups}, nrow={nrow(tgn)},
mem={object_size( tgn )}" ) )

#  res <- microbenchmark( saveRDS( tgn, str_c( "data/tgn_", ngroups ) )
#                       , times=5
#                       )
#
#  print( res )

   tgn
}

for ( ngroups in c( 1, 3, 10, 30, 100, 300, 1000, 3000, 10000 ) ) {
   investigate_nesting_effect( 10000, ngroups )
}

In this, ngroups is the number of groups to create. The first time round
the loop, all rows end up nested within one tibble; the final time, each
row is nested within its own tibble. In the function, t is the original
unnested tibble; tg is it grouped; tgn is it nested. tgnun is it unnested
again. tgnun sanity-checks my code by asserting that nesting and then
unnesting gives the original.

Now my results:
Length=10000, ngroups=1, nrow=1, mem=1,244,528
Length=10000, ngroups=3, nrow=3, mem=1,246,920
Length=10000, ngroups=10, nrow=10, mem=1,255,280
Length=10000, ngroups=30, nrow=30, mem=1,278,944
Length=10000, ngroups=100, nrow=100, mem=1,361,744
Length=10000, ngroups=300, nrow=300, mem=1,599,344
Length=10000, ngroups=1000, nrow=1000, mem=3,155,344
Length=10000, ngroups=3000, nrow=3000, mem=5,043,344
Length=10000, ngroups=10000, nrow=10000, mem=13,123,344

I've manually inserted commas into the memory figures to make them easier
to understand. Note that nesting every one of those 10,000 rows into a
tibble adds about 12 MB to the original. So that's very roughly about 1K
added per nested tibble, which seems a lot.

The code also contains some commented-out timings on saveRDS() . These
didn't show the same time blow-up that I experienced with my household
data, so I still need to replicate that reproducibly. It's equally
annoying, as it means users have to wait so long for data to be loaded.

Any thoughts would be welcome, including faults in my code above. For what
it's worth, here's my sessionInfo() :

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
  [1] scales_1.0.0        htmlTable_1.13.1    ggrepel_0.8.1
  [4] glue_1.3.1          magrittr_1.5        igraph_1.2.4.1
  [7] yaml_2.2.0          haven_2.1.1         pryr_0.1.4
[10] readxl_1.3.1        fs_1.3.1            memo_1.0.1
[13] Biobase_2.44.0      BiocGenerics_0.30.0 lubridate_1.7.4
[16] DT_0.7              shinyjs_1.0         shinyWidgets_0.4.8
[19] shiny_1.3.2         assertthat_0.2.1    forcats_0.4.0
[22] stringr_1.4.0       dplyr_0.8.3         purrr_0.3.2
[25] readr_1.3.1         tidyr_0.8.3         tibble_2.1.3
[28] ggplot2_3.2.0       tidyverse_1.2.1     conflicted_1.0.4
[31] BiocManager_1.30.4

loaded via a namespace (and not attached):
  [1] nlme_3.1-140      usethis_1.5.1     devtools_2.1.0    httr_1.4.0
  [5] rprojroot_1.3-2   tools_3.6.1       backports_1.1.4   R6_2.4.0
  [9] lazyeval_0.2.2    colorspace_1.4-1  withr_2.1.2       tidyselect_0.2.5
[13] prettyunits_1.0.2 processx_3.4.0    curl_3.3          compiler_3.6.1
[17] cli_1.1.0         rvest_0.3.4       xml2_1.2.0        desc_1.2.0
[21] checkmate_1.9.4   callr_3.3.0       digest_0.6.20     pkgconfig_2.0.2
[25] htmltools_0.3.6   sessioninfo_1.1.1 htmlwidgets_1.3   rlang_0.4.0
[29] rstudioapi_0.10   generics_0.0.2    jsonlite_1.6      crosstalk_1.0.0
[33] Rcpp_1.0.1        munsell_0.5.0     stringi_1.4.3     pkgbuild_1.0.3
[37] grid_3.6.1        promises_1.0.1    crayon_1.3.4      lattice_0.20-38
[41] hms_0.5.0         zeallot_0.1.0     knitr_1.23        ps_1.3.0
[45] pillar_1.4.2      codetools_0.2-16  pkgload_1.0.2     remotes_2.1.0
[49] modelr_0.1.4      vctrs_0.2.0       httpuv_1.5.1      testthat_2.1.1
[53] cellranger_1.1.0  gtable_0.3.0      xfun_0.8          mime_0.7
[57] xtable_1.8-4      broom_0.5.2       later_0.8.0       memoise_1.1.0

-Joc-

-- 
Aedin Culhane, PhD
---------------
Senior Research Scientist,

Department of Data Science,
Division of Biostatistics and Computational Biology,
Dana-Farber Cancer Institute, Mailstop LC9342,
450 Brookline Ave, Boston MA 02215

Department of Biostatistics,
Harvard TH Chan School of Public Health,
Boston MA 02215

Tel:     +1-617-632-2468
Email:   aedin using jimmy.harvard.edu
Twitter: @AedinCulhane
Our Department Twitter is @dfcidatascience

Office location:
LC9324 Longwood Center,
360 Longwood Avenue, DFCI, Boston.

Co-Founder, organizer of R/Bioconductor for Genomics Meetup
https://www.meetup.com/Boston-R-Bioconductor-for-genomics/
Leader of DF/HCC Bioinformatics and Computational Biology Navigator Service