[R-sig-hpc] Seeing some memory leak with foreach...

Simon Urbanek simon.urbanek at r-project.org
Tue Feb 26 22:18:38 CET 2013


On Feb 26, 2013, at 9:49 AM, Jonathan Greenberg wrote:

> r-sig-geo'ers:
> 
> I always hate doing this, but the test function/dataset is going to be
> hard to pass along to the list.  Basically: I have a foreach call that
> has no superassignments or strange environmental manipulations, but
> resulted in the nodes showing a slow but steady memory creep over
> time.  I was using a parallel backend for foreach via doParallel.  Has
> anyone else seen this behavior (unexplained memory creep)?  Is there a
> good way to "flush" a node?  I'm trying to embed gc() at the top of my
> foreach function, but this process took about 24 hours to get to a
> memory overuse stage (multiple iterations would have passed, e.g. the
> function would have been called more than one time on a single node)
> so I'm not sure if this will work so I figured I'd ask the group about
> it.  I've seen other people post about this on various boards with no
> clear response/solution to it (gc() apparently didn't work).
> 
> Some other notes: there should be no resultant output of data, because
> the output is being written from within the foreach function (e.g. the
> output of the function that foreach executes is NULL).
> 
> I'll see if I can work up a faster executing example later, but wanted
> to see if there are some general pointers for dealing with memory
> leaks using a parallel system.
> 

Just some general technical notes on memory management:

a) R is pretty good at releasing all objects on gc() - that is typically not a problem (in my experience). If you use 3rd party packages with native code especially accessing external libraries, it is more likely that memory leaks in such packages can become an issue. Second thing to be aware of are environments that are holding objects you'd rather not have them hold. This can be the global workspace or other objects stashed away (models typically contain environments with the data etc.). 

b) Note that gc() alone is of little use unless you make sure unused objects are out of scope. If you run gc() in the middle of a function (or at the end), it will only clear out temporary objects but not objects that have been assigned locally but are unused later. So to make sure you're doing the right thing you may want to split some heavy-lifting into chunks run in a local environment that gets out of scope and you only retain the intermediate result and run gc(). Also this is true even without explicit gc() because there will be implicit GC calls anyway.

c) Even if R releases memory, the OS is often unable to claim it back. This should in theory be no big problem as the memory gets re-used later, but it can grow into a problem if the general memory usage is high or when for some reason the memory gets fragmented a lot (again, for a function call without side-effect that should not be the case, but beware of the side-effects).

Cheers,
Simon


> --j
> 
> -- 
> Jonathan A. Greenberg, PhD
> Assistant Professor
> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
> Department of Geography and Geographic Information Science
> University of Illinois at Urbana-Champaign
> 607 South Mathews Avenue, MC 150
> Urbana, IL 61801
> Phone: 217-300-1924
> http://www.geog.illinois.edu/~jgrn/
> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> 
> 



More information about the R-sig-hpc mailing list