[Rd] Moderating consequences of garbage collection when in C
Martin Morgan
mtmorgan at fhcrc.org
Mon Nov 14 21:08:24 CET 2011
On 11/14/2011 11:47 AM, dhinds at sonic.net wrote:
> dhinds at sonic.net wrote:
>> Martin Morgan<mtmorgan at fhcrc.org> wrote:
>>> Allocating many small objects triggers numerous garbage collections as R
>>> grows its memory, seriously degrading performance. The specific use case
>>> is in creating a STRSXP of several 1,000,000's of elements of 60-100
>>> characters each; a simplified illustration understating the effects
>>> (because there is initially little to garbage collect, in contrast to an
>>> R session with several packages loaded) is below.
>
>> What a coincidence -- I was just going to post a question about why it
>> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
>> characters long. I had noticed that this seemed to show much worse
>> than linear scaling. I had not thought of garbage collection as the
>> culprit -- but indeed it is. By manipulating the GC trigger, I can
>> make this operation take as little as 3 seconds (with no GC) or as
>> long as 76 seconds (with 31 garbage collections).
>
> I had done some google searches on this issue, since it seemed like it
> should not be too uncommon, but the only other hit I could come up
> with was a thread from 2006:
>
> https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
>
> In any case, one issue with your suggested workaround is that it
> requires knowing how much additional storage is needed, which may be
> an expensive operation to determine. I've just tried implementing a
> different approach, which is to define two new functions to either
> disable or enable GC. The function to disable GC first invokes
> R_gc_full() to shrink the heap as much as possible, then sets a flag.
> Then in R_gc_internal(), I first check that flag, and if it is set, I
> call AdjustHeapSize(size_needed) and exit immediately.
I think this is a better approach; mine seriously understated the
complexity of figuring out required size.
> These calls could be used to bracket any code section that expects to
> make lots of calls to R's memory allocator. The down side is that
> this approach requires that all paths out of such a code section
> (including error handling) need to take care to unset the GC-disabled
> flag. I think I would want to hear from someone on the R team about
> whether they think this is a good idea.
>
> A final alternative might be to provide a vectorized version of mkChar
> that would accept a char ** and use one of these methods internally,
> rather than exporting the underlying methods as part of R's API. I
> don't know if there are other clear use cases where GC is a serious
> bottleneck, besides constructing large vectors of mostly unique
> strings. Such a function would be less generally useful since it
> would require that the full vector of C strings be assembled at one
> time.
Another place where this comes up is during package load, especially for
packages with many S4 instances.
> gcinfo(TRUE)
> library(Matrix)
Garbage collection 2 = 1+0+1 (level 0) ...
7.6 Mbytes of cons cells used (40%)
1.1 Mbytes of vectors used (18%)
...
Garbage collection 58 = 39+9+10 (level 2) ...
39.4 Mbytes of cons cells used (75%)
2.9 Mbytes of vectors used (47%)
and continuing
> library(IRanges)
...
Garbage collection 89 = 60+14+15 (level 1) ...
63.1 Mbytes of cons cells used (80%)
4.3 Mbytes of vectors used (53%)
Also, something like
> system.time(as.character(1:10000000))
...
Garbage collection 124 = 60+14+50 (level 2) ...
596.1 Mbytes of cons cells used (95%)
226.3 Mbytes of vectors used (69%)
user system elapsed
61.908 0.297 62.303
might be an R-level manifestation of the same problem.
Being able to disable / enable the GC seems like a useful patch, and I
hope this is interesting enough for the R-core team.
A more fundamental issue seems to be garbage collection when there are a
lot of SEXP in play
> system.time(gc())
user system elapsed
0.236 0.000 0.236
There's a hierarchy of CHARSXP / STRSXP, so maybe that could be
exploited in the mark phase?
Martin
> -- Dave
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-devel
mailing list