[Rd] allocVector bug ?

Luke Tierney luke at stat.uiowa.edu
Wed Nov 8 18:56:12 CET 2006


On Mon, 6 Nov 2006, Vladimir Dergachev wrote:

>
> Hi Luke,
>
>   Thank you for the patient reply !
>
>   I have looked into the issue a little deeper, comments below:
>
> On Thursday 02 November 2006 11:26 pm, Luke Tierney wrote:
>> On Wed, 1 Nov 2006, Vladimir Dergachev wrote:
>>> Hi all,
>>>
>>>  I was looking at the following piece of code in src/main/memory.c,
>>> function allocVector :
>>>
>>>    if (size <= NodeClassSize[1]) {
>>> 	node_class = 1;
>>> 	alloc_size = NodeClassSize[1];
>>>    }
>>>    else {
>>> 	node_class = LARGE_NODE_CLASS;
>>> 	alloc_size = size;
>>> 	for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) {
>>> 	    if (size <= NodeClassSize[i]) {
>>> 		node_class = i;
>>> 		alloc_size = NodeClassSize[i];
>>> 		break;
>>> 	    }
>>> 	}
>>>    }
>>>
>>>
>>> It appears that for LARGE_NODE_CLASS the variable alloc_size should not
>>> be size, but something far less as we are not using vector heap, but
>>> rather calling malloc directly in the code below (and from discussions I
>>> read on this mailing list I think that these two are different - please
>>> let me know if I am wrong).
>>>
>>> So when allocate a large vector the garbage collector goes nuts trying to
>>> find all that space which is not going to be needed after all.
>>
>> This is as intended, not a bug. The garbage collector does not "go
>> nuts" -- it is doing a garbage collection that may release memory in
>> advance of making a large allocation.  The size of the current
>> allocation request is used as part of the process of deciding when to
>> satisfy an allocation by malloc (of a single large noda or a page) and
>> when to first do a gc.  It is essential to do this for large
>> allocations as well to keep the memory footprint down and help reduce
>> fragmentation.
>
> I generally agree with this, however I believe that current logic breaks down
> for large allocation sizes and my code ends up spending 70% (and up) of
> computer time spinning inside garbage collector (I run oprofile to observe
> what is going on).

Again please be careful about these sorts of statements.  I am sure
there are bugs in the memory manager and places where things "break
down" but this isn't one of them.  The memory manager is quite
deliberately biased towards keeping the total allocation low, if
necessary at the expense of some extra gc overhead.  This is needed if
we want to use the same settings across a wide range of
configurations, some of which have relatively little memory available
(think student labs).  The memory manager does try to learn about the
needs of a session, and as a result triggering value get adjusted.  It
is not true that every large allocation causes a gc.  This may be true
_initially_, but once total memory usage stabilizes at a particular
level it is no longer true (look at the way the heap limits are
adjusted).

This approach of adjusting based on usage within a session is
reasonable and works well for longer sessions.  It may not work well
for short scripts that need large allocations.  I doubt that any
automated setting can work well in that situation while at the same
time keeping memory usage in other settings low. So it may be useful
to find ways of specifying a collection strategy appropriate for these
situations. If you can send me a simplified version of your usage
scenario then I will give this some thought and see if we can come up
with some reasonable ways of allowing user code to tweak gc behavior
for these situations.

Best,

luke

>
> I do realize that garbage collection is not an easy problem  and that hardware
> and software environments change - my desire is simply to have a version of R
> that is usable for the problems I am dealing with as, aside from slowdown
> with large vector sizes, I find R a very capable tool.
>
> I would greatly appreciate if you could comment on the following observations:
>
>  1. The time spent during single garbage collector run grows with the number
> of nodes - from looking at the code I believe it is linear, but I am not
> certain.
>
>  2. In my case the data.frame contains a few string vectors. These allocate
> lots of CHARSXPs which are the main cause of slowdown of each garbage
> collector run. Would you have any suggestions on optimizing this particular
> situation ?
>
>  3. Any time a data.frame is created, or one performs an attach() operation
> there is a series of allocations - and if one of them causes memory to expand
> all the rest will too.
>
>  I put in a fprintf() statement to show alloc_size, VHEAP_FREE and RV_size
> when allocVector is called (this is done only for node_class ==
> LARGE_NODE_CLASS).
>
>  First output snippet is from the time the script starts and tries to create
> data.frame:
>
> alloc_size=128 VHEAP_FREE=604182 R_VSize=786432
> alloc_size=88 VHEAP_FREE=660051 R_VSize=786432
> alloc_size=88 VHEAP_FREE=659963 R_VSize=786432
> alloc_size=4078820 VHEAP_FREE=659874 R_VSize=786432
> alloc_size=4078820 VHEAP_FREE=260678 R_VSize=4465461
> alloc_size=4078820 VHEAP_FREE=260678 R_VSize=8544282
> alloc_size=4078820 VHEAP_FREE=260678 R_VSize=12623103
> ...
> alloc_size=4078820 VHEAP_FREE=260677 R_VSize=271628325
> alloc_size=4078820 VHEAP_FREE=260677 R_VSize=275707147
>
> As you can see the VHEAP_FREE()<alloc_size logic gets triggered every single
> time and we run the garbage collector even though the memory footprint is
> clearly growing. The fact that in my case the first few columns are strings
> with lots of different values makes this particularly bad.
>
> Next, (as a test) I run attach(B) which produced the following output:
>> attach(B)
> alloc_size=4078820 VHEAP_FREE=1274112 R_VSize=294022636
> alloc_size=4078820 VHEAP_FREE=499351 R_VSize=297325768
> ...
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=568670030
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=572748850
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=576827670
> alloc_size=88 VHEAP_FREE=602082 R_VSize=580906490
> alloc_size=88 VHEAP_FREE=601915 R_VSize=580906490
> alloc_size=88 VHEAP_FREE=601798 R_VSize=580906490
> alloc_size=88 VHEAP_FREE=601678 R_VSize=580906490
> ...
> alloc_size=44 VHEAP_FREE=591581 R_VSize=580906490
> alloc_size=88 VHEAP_FREE=591323 R_VSize=580906490
> alloc_size=44 VHEAP_FREE=591220 R_VSize=580906490
>
> So we have the same behaviour as before - the garbage collector gets run every
> time attach creates a new large vector, but functions perfectly for smaller
> vector sizes.
>
> Next, I did detach(B) (which freed up memory) followed by "F<-B[,1]":
>
> alloc_size=113 VHEAP_FREE=588448 R_VSize=580906490
> alloc_size=618 VHEAP_FREE=588335 R_VSize=580906490
> alloc_size=618 VHEAP_FREE=587717 R_VSize=580906490
> alloc_size=128 VHEAP_FREE=587099 R_VSize=580906490
> alloc_size=88 VHEAP_FREE=586825 R_VSize=580906490
> alloc_size=4078820 VHEAP_FREE=586737 R_VSize=580906490
> alloc_size=4078820 VHEAP_FREE=284079854 R_VSize=580906490
> alloc_size=4078820 VHEAP_FREE=280001034 R_VSize=580906490
> alloc_size=4078820 VHEAP_FREE=275922214 R_VSize=580906490
> alloc_size=4078820 VHEAP_FREE=271843394 R_VSize=580906490
> ...
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=843990775
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=848069595
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=852148415
> alloc_size=4078820 VHEAP_FREE=602082 R_VSize=856227235
> alloc_size=88 VHEAP_FREE=602082 R_VSize=860306055
>
> In this case the logic worked - it freed memory used up by attach() but
> stopped working after that memory was consumed. (Why subscription operator
> tries to allocate an entire copy of the data.frame is a completely separate
> question - I am looking into it, but with slow progress).
>
> My latest attempt to address this is the following modification to
> allocVector:
>
>
> R_size_t vheap_grown=0;
> R_size_t vheap_reused=0;
> int count_skip=0;
> int max_skip=0;
>
> SEXP allocVector(SEXPTYPE type, R_len_t length)
> {
> ...
> ...
>    /* we need to do the gc here so allocSExp doesn't! */
>    if (FORCE_GC || NO_FREE_NODES() || ((VHEAP_FREE() < alloc_size) && !
> count_skip)) {
> 	fprintf(stderr, "Running gc\n");
> 	R_gc_internal(alloc_size);
> 	if (NO_FREE_NODES())
> 	    mem_err_cons();
> 	if (VHEAP_FREE() < alloc_size)
> 	    mem_err_heap(size);
>
> 	/* we use 2*alloc_size as if we run out of memory the
>          gc will grow R_VSize to alloc for alloc_size allocation */
>        if(VHEAP_FREE() < 2*alloc_size) {
> 	    vheap_grown+=alloc_size;
> 	    if(vheap_grown>vheap_reused) {
> 		max_skip=2*max_skip+1;
> 		count_skip=max_skip;
> 		}
> 	    } else {
> 	    vheap_reused+=alloc_size;
> 	    if(vheap_reused>2*vheap_grown) {
> 		   vheap_reused=0;
> 		   vheap_grown=0;
> 		   max_skip=0;
> 		   }
> 	    }
>    }
>
>    /* Adjust R_VSize if we skipped garbage collection */
>    if( (VHEAP_FREE() < alloc_size) && count_skip) {
> 	 R_VSize+=alloc_size;
> 	 count_skip --;
> 	 }
> ...
> ...
> }
>
> This code attempts to detect situations when we are growing memory footprint
> and skip garbage collection before making a check again.
>
>                              thank you very much !
>
>                                            Vladimir Dergachev
>
>
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list