[Rd] allocVector bug ?

Vladimir Dergachev vdergachev at rcgardis.com
Thu Nov 9 19:46:56 CET 2006


On Thursday 09 November 2006 12:21 pm, Luke Tierney wrote:
> On Wed, 8 Nov 2006, Vladimir Dergachev wrote:
> > On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
> >> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> >
> > Hi Luke,
> >
> >   Yes, I gladly concede the point that for a heuristic algorithm the
> > notion of what is a "bug" is murky (besides crashes, etc, which is not
> > what I am not talking about).
> >
> >   Here is why I called this a bug:
> >
> >     1. My understanding is that each time gc() needs to increase memory
> > it performs a full garbage collection run. Right ?
>
> The allocation process does not call gc before every call to malloc.
> It only calls gc if the allocation would cross a threshold level.
> Those theshold levels are adjusted in an effort to compromise between
> keeping memory footprint low and not calling gc too often.  The code
> you quote below is part of this adjustment process.  If this process
> is working properly then as memory use grows there will initially be
> more gc activity and then less as the thresholds adjust.

Well, I was seeing it call gc for every large vector. This probably happens be 
only for those larger  than R_VGrowIncrFrac * R_NSize. On my system R_NSize 
is never more than 1e6 so this would explain the problems when using 1e6 (and 
larger) vectors.

>
> >     2. This is not a problem with small memory sizes as they imply
> > (presumably) small number of objects.
> >
> >     3. However, if one wants to allocate many objects (say columns in a
> > data frame or just vectors) this results in large penalty
> >
> > Example 1: This simulates allocation of a data.frame with some character
> > columns which are assumed to be factors. On my system first assignment is
> > nearly instantaneous, why subsequent assignments take slightly less than
> > 0.1 seconds each.
>
> I'm not sure these are quite doing what you intend.  You define Chars
> but don't use it.  Also, system.time by default calls gc() before
> doing the evaluation. Giving FALSE as the second argument may give you
> a more realistic picture.

The Chars are defined to create lots of ncells and make gc() run time more 
realistic. It also mimics having a data.frame with a few factor columns.

As for system.time - thank you, I missed that ! 
Setting gcFirst=FALSE changes behavior in the first example to be 2 times 
faster and makes all the allocations in the second example faster.

I guess that extra call to gc() caused R_VSize to shrink too fast.

> > I looked more carefully at your code in src/main/memory.c, function
> > AdjustHeapSize:
> >
> > R_VSize = VNeeded;
> >    if (vect_occup > R_VGrowFrac) {
> > 	R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
> > 	if (R_MaxVSize - R_VSize >= change)
> > 	    R_VSize += change;
> >    }
> >
> > Could it be that R_NSize should be R_VSize ? This would explain why I see
> > a problem in case R_VSize>>R_NSize.
>
> That does indeed look like a bug and that R_NSize should be R_VSize --
> well spotted, thanks.  I will need to experiment with this a bit more
> to see if it can safely be changed.  It will increase the memory
> footprint a bit.  Probaly not by enough to matter but if it does we
> may need to adjust some of the tuning constants.
>

Would there be something I can help you with ? Is there a script to run 
through common usage patterns ?

                          thank you !

                                  Vladimir Dergachev


> Best,
>
> luke
>



More information about the R-devel mailing list