[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Fri Aug 10 10:26:52 CEST 2007

I don't understand why one would run a 64-bit version of R on a 2GB 
server, especially if one were worried about object size.  You can run 
32-bit versions of R on x86_64 Linux (see the R-admin manual for a 
comprehensive discussion), and most other 64-bit OSes default to 32-bit 
executables.

Since most OSes limit 32-bit executables to around 3GB of address space, 
there starts to become a case for 64-bit executables at 4GB RAM but not 
much case at 2GB.

It was my intention when providing the infrastructure for it that Linux 
binary distributions on x86_64 would provide both 32-bit and 64-bit 
executables, but that has not happened.  It would be possible to install 
ix86 builds on x86_64 if -m32 was part of the ix86 compiler specification 
and the dependency checks would notice they needed 32-bit libraries. 
(I've had trouble with the latter on FC5: an X11 update removed all my 
32-bit X11 RPMs.)

On Fri, 10 Aug 2007, Michael Cassin wrote:

> Thanks for all the comments,
>
> The artificial dataset is as representative of my 440MB file as I could design.
>
> I did my best to reduce the complexity of my problem to minimal
> reproducible code as suggested in the posting guidelines.  Having
> searched the archives, I was happy to find that the topic had been
> covered, where Prof Ripley suggested that the I/O manuals gave some
> advice.  However, I was unable to get anywhere with the I/O manuals
> advice.
>
> I spent 6 hours preparing my post to R-help. Sorry not to have read
> the 'R-Internals' manual.  I just wanted to know if I could use scan()
> more efficiently.
>
> My hurdle seems nothing to do with efficiently calling scan() .  I
> suspect the same is true for the originator of this memory experiment
> thread. It is the overhead of storing short strings, as Charles
> identified and Brian explained.  I appreciate the investigation and
> clarification you both have made.
>
> 56B overhead for a 2 character string seems extreme to me, but I'm not
> complaining. I really like R, and being free, accept that
> it-is-what-it-is.

Well, there are only about 50000 2-char strings in an 8-bit locale, so 
this does seem a case for using factors (as has been pointed out several 
times).

And BTW, it is not 56B overhead, but 56B total for up to 7 chars.

> In my case pre-processing is not an option, it is not a one off
> problem with a particular file. In my application, R is run in batch
> mode as part of a tool chain for arbitrary csv files.  Having found
> cases where memory usage was as high as 20x file size, and allowing
> for a copy of the the loaded dataset, I'll just need to document that
> it is possible that files as small as 1/40th of system memory may
> consume it all.  That rules out some important datasets (US Census, UK
> Office of National Statistics files, etc) for 2GB servers.
>
> Regards, Mike
>
>
> On 8/9/07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>> On Thu, 9 Aug 2007, Charles C. Berry wrote:
>>
>>> On Thu, 9 Aug 2007, Michael Cassin wrote:
>>>
>>>> I really appreciate the advice and this database solution will be useful to
>>>> me for other problems, but in this case I  need to address the specific
>>>> problem of scan and read.* using so much memory.
>>>>
>>>> Is this expected behaviour?
>>
>> Yes, and documented in the 'R Internals' manual.  That is basic reading
>> for people wishing to comment on efficiency issues in R.
>>
>>>> Can the memory usage be explained, and can it be
>>>> made more efficient?  For what it's worth, I'd be glad to try to help if the
>>>> code for scan is considered to be worth reviewing.
>>>
>>> Mike,
>>>
>>> This does not seem to be an issue with scan() per se.
>>>
>>> Notice the difference in size of big2, big3, and bigThree here:
>>>
>>>> big2 <- rep(letters,length=1e6)
>>>> object.size(big2)/1e6
>>> [1] 4.000856
>>>> big3 <- paste(big2,big2,sep='')
>>>> object.size(big3)/1e6
>>> [1] 36.00002
>>
>> On a 32-bit computer every R object has an overhead of 24 or 28 bytes.
>> Character strings are R objects, but in some functions such as rep (and
>> scan for up to 10,000 distinct strings) the objects can be shared.  More
>> string objects will be shared in 2.6.0 (but factors are designed to be
>> efficient at storing character vectors with few values).
>>
>> On a 64-bit computer the overhead is usually double.  So I would expect
>> just over 56 bytes/string for distinct short strings (and that is what
>> big3 gives).
>>
>> But 56Mb is really not very much (tiny on a 64-bit computer), and 1
>> million items is a lot.
>>
>> [...]
>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595