[R-pkg-devel] Unused data is silently kept in the environment of a function

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Fri Jul 8 17:01:23 CEST 2022


I accidentally replied privately to this message.  Here is the reply 
that I intended to send to the list, along with an addition based on 
Samuel's reply to me.

On 08/07/2022 9:50 a.m., Samuel Granjeaud wrote:
 > > Dear all,
 > >
 > > I want to compute processing functions to apply to the data.
 > > I apply the functions to the data in a second step.
 > > proc_0 increases the memory, proc_1 is safe.
 > > reprex below.
 > >
 > > If this behavior is known, could you tell me a workaround before I try
 > > to guess the best one?

When a function is called, it creates an environment that holds the
arguments and all local variables.  If the function returns that
environment, or a value that references it, all the local variables will
still be there.

In your function I believe the anonymous functions you create in `model`
are catching the environment.  Since those functions are created as part
of the evaluation of proc_0, each of them will have the evaluation
environment attached.

NEW addition:  In R, functions have an associated environment set as the 
parent of the evaluation environment mentioned above.  Those are called 
"the environment of the function", and can be retrieved from function fn 
using `environment(fn)`.  For top-level functions like proc_0, 
environment(proc_0) would be the global environment, but for functions 
created within another function, it would be the evaluation environment 
active at the time of creation.

Your code has

   sapply(cofactors, function(cofactor) function(z) z / cofactor)

This creates the function with definition

   function(cofactor) function(z) z / cofactor

The environment of that function will be the evaluation environment of 
proc_0.  When that function is called by sapply(), it will create an 
evaluation environment holding cofactor, and that environment will be 
used by the function returned, i.e. the result of

   function(z) z / cofactor

So you'll end up with this chain of environments:

   environment(function(z) z / cofactor) is the evaluation environment 
of function(cofactor) function(z) z / cofactor;

   its parent is the evaluation environment of proc_0, containing dat;

   its parent is environment(proc_0), which is the global environment.

The global environment isn't captured, but the others are, so you save a 
copy of dat every time you call proc_0.

But none of those functions need access to dat, so there's no need to 
keep it, and after your last use of it in proc_0, just run rm(dat) to 
get rid of it.

OLD part again:

By the way, mem_used() isn't a great way to measure memory use, because
it will count things that will be cleaned up in a future garbage
collection.  When I added "rm(dat)" to your function, I saw this:

  > a = new("fb")
  > a using d = sample(rnorm(1e7))
  > a using f = list()
  > mem_used()
363 MB
  > b = proc_0(a)
  > mem_used()
283 MB

i.e. *less* memory was used after b was created, presumably because a gc
happened.

It's better to use object.size() or pryr::object_size() to measure the
size of individual objects.  Neither one is perfect: they use different
rules to decide what to include, and in some cases, memory used in one
object is counted again as part of another.  The way R allocated memory
means there is *no* perfect definition of the size of an object.

Duncan Murdoch


 > >
 > > Best,
 > > Samuel
 > >
 > > ``` r
 > > # for memory tracking
 > > library(pryr)
 > >
 > > # a class
 > > setClass(
 > >     "fb",
 > >     slots = list(d = "numeric", f = "list"),
 > >     prototype=list(d = NULL, f = NULL)
 > > )
 > >
 > > # memory increased: keep dat somewhere and link it back to the returned
 > > value
 > > proc_0 <- function(x) {
 > >     dat = sample(x using d)
 > >     cofactors = c(mean(dat), median(dat), IQR(dat))
 > >     model = sapply(cofactors, function(cofactor) function(z) z / 
cofactor)
 > >     x using f = list(model)
 > >     x
 > > }
 > >
 > > # init data
 > > mem_used()
 > > #> 47 MB
 > > a = new("fb")
 > > a using d = sample(rnorm(1e7))
 > > a using f = list()
 > > mem_used()
 > > #> 127 MB
 > > # memory increased of 80 MB
 > > # process
 > > b = proc_0(a)
 > > mem_used()
 > > #> 207 MB
 > > # memory increased of 80 MB again
 > > rm(a)
 > > mem_used()
 > > #> 207 MB
 > > # memory didn't decreased
 > > b using d = b using d + 1
 > > mem_used()
 > > #> 287 MB
 > > # memory increased
 > > # b using d was really pointing to a using d before increment
 > > sapply(1:3, function(i) ls(environment(b using f[[1]][[i]])))
 > > #> [1] "cofactor" "cofactor" "cofactor"
 > > sapply(1:3, function(i) get("cofactor", environment(b using f[[1]][[i]])))
 > > #> [1] -0.0003085559  0.0001107148  1.3485980291
 > > # environments look fine
 > > rm(b)
 > > mem_used()
 > > #> 47.5 MB
 > > # memory released back
 > >
 > >
 > > # memory safe
 > > proc_1 <- function(x) {
 > >     cofactors = c(mean(x using d), median(x using d), IQR(x using d))
 > >     model = sapply(cofactors, function(cofactor) function(z) z / 
cofactor)
 > >     x using f = list(model)
 > >     x
 > > }
 > >
 > > # init data
 > > mem_used()
 > > #> 47.5 MB
 > > a = new("fb")
 > > a using d = sample(rnorm(1e7))
 > > a using f = list()
 > > mem_used()
 > > #> 128 MB
 > > b = proc_1(a)
 > > mem_used()
 > > #> 128 MB
 > > # memory didn't increased; b using d points to a using d; functions weight a few KB
 > > rm(a)
 > > mem_used()
 > > #> 128 MB
 > > sapply(1:3, function(i) ls(environment(b using f[[1]][[i]])))
 > > #> [1] "cofactor" "cofactor" "cofactor"
 > > sapply(1:3, function(i) get("cofactor", environment(b using f[[1]][[i]])))
 > > #> [1] -0.0003133312 -0.0002510665  1.3491459433
 > >
 > > rm(b)
 > > mem_used()
 > > #> 47.5 MB
 > >
 > > ```
 > >
 > > <sup>Created on 2022-07-08 by the [reprex
 > > package](https://reprex.tidyverse.org) (v2.0.1)</sup>
 > >
 > > <details style="margin-bottom:10px;">
 > > <summary>
 > > Session info
 > > </summary>
 > >
 > > ``` r
 > > sessionInfo()
 > > #> R version 4.2.1 (2022-06-23 ucrt)
 > > #> Platform: x86_64-w64-mingw32/x64 (64-bit)
 > > #> Running under: Windows 10 x64 (build 19044)
 > > #>
 > > #> Matrix products: default
 > > #>
 > > #> locale:
 > > #> [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
 > > #> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
 > > #> [5] LC_TIME=French_France.utf8
 > > #>
 > > #> attached base packages:
 > > #> [1] stats     graphics  grDevices utils     datasets methods   base
 > > #>
 > > #> other attached packages:
 > > #> [1] pryr_0.1.5
 > > #>
 > > #> loaded via a namespace (and not attached):
 > > #>  [1] Rcpp_1.0.8.3     codetools_0.2-18 digest_0.6.29 withr_2.5.0
 > > #>  [5] magrittr_2.0.3   reprex_2.0.1     evaluate_0.15 highr_0.9
 > > #>  [9] stringi_1.7.6    rlang_1.0.3      cli_3.3.0 rstudioapi_0.13
 > > #> [13] fs_1.5.2         lobstr_1.1.2     rmarkdown_2.14 tools_4.2.1
 > > #> [17] stringr_1.4.0    glue_1.6.2       xfun_0.31 yaml_2.3.5
 > > #> [21] fastmap_1.1.0    compiler_4.2.1   htmltools_0.5.2 knitr_1.39
 > > ```
 > >
 > > </details>
 > >
 > > ______________________________________________
 > > R-package-devel using r-project.org mailing list
 > > https://stat.ethz.ch/mailman/listinfo/r-package-devel



More information about the R-package-devel mailing list