[Rd] Moderating consequences of garbage collection when in C
Thomas Lumley
tlumley at uw.edu
Mon Nov 14 21:00:10 CET 2011
On Tue, Nov 15, 2011 at 8:47 AM, <dhinds at sonic.net> wrote:
> dhinds at sonic.net wrote:
>> Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> > Allocating many small objects triggers numerous garbage collections as R
>> > grows its memory, seriously degrading performance. The specific use case
>> > is in creating a STRSXP of several 1,000,000's of elements of 60-100
>> > characters each; a simplified illustration understating the effects
>> > (because there is initially little to garbage collect, in contrast to an
>> > R session with several packages loaded) is below.
>
>> What a coincidence -- I was just going to post a question about why it
>> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
>> characters long. I had noticed that this seemed to show much worse
>> than linear scaling. I had not thought of garbage collection as the
>> culprit -- but indeed it is. By manipulating the GC trigger, I can
>> make this operation take as little as 3 seconds (with no GC) or as
>> long as 76 seconds (with 31 garbage collections).
>
> I had done some google searches on this issue, since it seemed like it
> should not be too uncommon, but the only other hit I could come up
> with was a thread from 2006:
>
> https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
>
> In any case, one issue with your suggested workaround is that it
> requires knowing how much additional storage is needed, which may be
> an expensive operation to determine. I've just tried implementing a
> different approach, which is to define two new functions to either
> disable or enable GC. The function to disable GC first invokes
> R_gc_full() to shrink the heap as much as possible, then sets a flag.
> Then in R_gc_internal(), I first check that flag, and if it is set, I
> call AdjustHeapSize(size_needed) and exit immediately.
>
> These calls could be used to bracket any code section that expects to
> make lots of calls to R's memory allocator. The down side is that
> this approach requires that all paths out of such a code section
> (including error handling) need to take care to unset the GC-disabled
> flag. I think I would want to hear from someone on the R team about
> whether they think this is a good idea.
If .Call and .C re-enabled the GC on return from compiled code (and
threw some sort of error) that would help contain the potential
damage.
You'd might also want to re-enable GC if malloc() returned NULL,
rather than giving an out-of-memory error.
-thomas
--
Thomas Lumley
Professor of Biostatistics
University of Auckland
More information about the R-devel
mailing list