[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Thu Aug 9 23:18:32 CEST 2007

On Thu, 9 Aug 2007, Charles C. Berry wrote:

> On Thu, 9 Aug 2007, Michael Cassin wrote:
>
>> I really appreciate the advice and this database solution will be useful to
>> me for other problems, but in this case I  need to address the specific
>> problem of scan and read.* using so much memory.
>>
>> Is this expected behaviour?

Yes, and documented in the 'R Internals' manual.  That is basic reading 
for people wishing to comment on efficiency issues in R.

>> Can the memory usage be explained, and can it be
>> made more efficient?  For what it's worth, I'd be glad to try to help if the
>> code for scan is considered to be worth reviewing.
>
> Mike,
>
> This does not seem to be an issue with scan() per se.
>
> Notice the difference in size of big2, big3, and bigThree here:
>
>> big2 <- rep(letters,length=1e6)
>> object.size(big2)/1e6
> [1] 4.000856
>> big3 <- paste(big2,big2,sep='')
>> object.size(big3)/1e6
> [1] 36.00002

On a 32-bit computer every R object has an overhead of 24 or 28 bytes. 
Character strings are R objects, but in some functions such as rep (and 
scan for up to 10,000 distinct strings) the objects can be shared.  More 
string objects will be shared in 2.6.0 (but factors are designed to be 
efficient at storing character vectors with few values).

On a 64-bit computer the overhead is usually double.  So I would expect 
just over 56 bytes/string for distinct short strings (and that is what 
big3 gives).

But 56Mb is really not very much (tiny on a 64-bit computer), and 1 
million items is a lot.

[...]

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595