[Rd] mclapply memory leak?
simon.urbanek at r-project.org
Thu Sep 3 23:27:12 CEST 2015
> On Sep 2, 2015, at 1:12 PM, Toby Hocking <tdhock5 at gmail.com> wrote:
> Dear R-devel,
> I am running mclapply with many iterations over a function that modifies
> nothing and makes no copies of anything. It is taking up a lot of memory,
> so it seems to me like this is a bug. Should I post this to
> A minimal reproducible example can be obtained by first starting a memory
> monitoring program such as htop, and then executing the following code
> while looking at how much memory is being used by the system
> seconds <- 5
> N <- 100000
> result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))
> On my system, memory usage goes up about 60MB on this example. But it does
> not go up at all if I change mclapply to lapply. Is this a bug?
> For a more detailed discussion with a figure that shows that the memory
> overhead is linear in N, please see
I'm not quite sure what is supposed to be the issue here. One would expect the memory used will be linear in the number elements you process - by definition of the task, since you'll be creating linearly many more objects.
Also using top doesn't actually measure the memory used by R itself (see FAQ 7.42).
That said, I re-run your script and it didn't look anything like what you have on your webpage. For the NULL result you end up dealing will all the objects you create in your test that overshadow any memory usage and stabilizes after garbage-collection. As you would expect, any output of top is essentially bogus up to a gc. How much memory R will use is essentially governed by the level at which you set the gc trigger. In real world you actually want that to be fairly high if you can afford it (in gigabytes, not megabytes), because you get often much higher performance by delaying gcs if you don't have low total memory (essentially using the memory as a buffer). Given that the usage is so negligible, it won't trigger any gc on its own, so you're just measuring accumulated objects - which will be always higher for mclapply because of the bookkeeping and serialization involved in the communication.
The real difference is only in the df case. The reason for it is that your lapply() there is simply a no-op, because R is smart enough to realize that you are always returning the same object so it won't actually create anything and just return a reference back to df - thus using no memory at all. However, once you split the inputs, your main session can no longer perform this optimization because the processing is now in a separate process, so it has no way of knowing that you are returning the object unmodified. So what you are measuring is a special case that is arguably not really relevant in real applications.
> R version 3.2.2 (2015-08-14)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu precise (12.04.5 LTS)
>  LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>  LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8
>  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8
>  LC_PAPER=en_US.UTF-8 LC_NAME=C
>  LC_ADDRESS=C LC_TELEPHONE=C
>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  parallel graphics utils datasets stats grDevices methods
>  base
> other attached packages:
>  ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33
> loaded via a namespace (and not attached):
>  Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43
>  grid_3.2.2 plyr_1.8.1 gtable_0.1.2
>  scales_0.2.3 reshape2_1.2.2 proto_1.0.0
>  labeling_0.2 tools_3.2.2 stringr_0.6.2
>  dichromat_2.0-0 munsell_0.4.2 PeakSegJoint_2015.08.06
>  compiler_3.2.2 colorspace_1.2-4
> [[alternative HTML version deleted]]
> R-devel at r-project.org mailing list
More information about the R-devel