[Rd] Moderating consequences of garbage collection when in C

dhinds at sonic.net dhinds at sonic.net
Mon Nov 14 20:47:35 CET 2011


dhinds at sonic.net wrote:
> Martin Morgan <mtmorgan at fhcrc.org> wrote:
> > Allocating many small objects triggers numerous garbage collections as R 
> > grows its memory, seriously degrading performance. The specific use case 
> > is in creating a STRSXP of several 1,000,000's of elements of 60-100 
> > characters each; a simplified illustration understating the effects 
> > (because there is initially little to garbage collect, in contrast to an 
> > R session with several packages loaded) is below.

> What a coincidence -- I was just going to post a question about why it
> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
> characters long.  I had noticed that this seemed to show much worse
> than linear scaling.  I had not thought of garbage collection as the
> culprit -- but indeed it is.  By manipulating the GC trigger, I can
> make this operation take as little as 3 seconds (with no GC) or as
> long as 76 seconds (with 31 garbage collections).

I had done some google searches on this issue, since it seemed like it
should not be too uncommon, but the only other hit I could come up
with was a thread from 2006:

https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html

In any case, one issue with your suggested workaround is that it
requires knowing how much additional storage is needed, which may be
an expensive operation to determine.  I've just tried implementing a
different approach, which is to define two new functions to either
disable or enable GC.  The function to disable GC first invokes
R_gc_full() to shrink the heap as much as possible, then sets a flag.
Then in R_gc_internal(), I first check that flag, and if it is set, I
call AdjustHeapSize(size_needed) and exit immediately.

These calls could be used to bracket any code section that expects to
make lots of calls to R's memory allocator.  The down side is that
this approach requires that all paths out of such a code section
(including error handling) need to take care to unset the GC-disabled
flag.  I think I would want to hear from someone on the R team about
whether they think this is a good idea.

A final alternative might be to provide a vectorized version of mkChar
that would accept a char ** and use one of these methods internally,
rather than exporting the underlying methods as part of R's API.  I
don't know if there are other clear use cases where GC is a serious
bottleneck, besides constructing large vectors of mostly unique
strings.  Such a function would be less generally useful since it 
would require that the full vector of C strings be assembled at one
time.

-- Dave



More information about the R-devel mailing list