[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Aug 10 10:26:52 CEST 2007
I don't understand why one would run a 64-bit version of R on a 2GB
server, especially if one were worried about object size. You can run
32-bit versions of R on x86_64 Linux (see the R-admin manual for a
comprehensive discussion), and most other 64-bit OSes default to 32-bit
Since most OSes limit 32-bit executables to around 3GB of address space,
there starts to become a case for 64-bit executables at 4GB RAM but not
much case at 2GB.
It was my intention when providing the infrastructure for it that Linux
binary distributions on x86_64 would provide both 32-bit and 64-bit
executables, but that has not happened. It would be possible to install
ix86 builds on x86_64 if -m32 was part of the ix86 compiler specification
and the dependency checks would notice they needed 32-bit libraries.
(I've had trouble with the latter on FC5: an X11 update removed all my
32-bit X11 RPMs.)
On Fri, 10 Aug 2007, Michael Cassin wrote:
> Thanks for all the comments,
> The artificial dataset is as representative of my 440MB file as I could design.
> I did my best to reduce the complexity of my problem to minimal
> reproducible code as suggested in the posting guidelines. Having
> searched the archives, I was happy to find that the topic had been
> covered, where Prof Ripley suggested that the I/O manuals gave some
> advice. However, I was unable to get anywhere with the I/O manuals
> I spent 6 hours preparing my post to R-help. Sorry not to have read
> the 'R-Internals' manual. I just wanted to know if I could use scan()
> more efficiently.
> My hurdle seems nothing to do with efficiently calling scan() . I
> suspect the same is true for the originator of this memory experiment
> thread. It is the overhead of storing short strings, as Charles
> identified and Brian explained. I appreciate the investigation and
> clarification you both have made.
> 56B overhead for a 2 character string seems extreme to me, but I'm not
> complaining. I really like R, and being free, accept that
Well, there are only about 50000 2-char strings in an 8-bit locale, so
this does seem a case for using factors (as has been pointed out several
And BTW, it is not 56B overhead, but 56B total for up to 7 chars.
> In my case pre-processing is not an option, it is not a one off
> problem with a particular file. In my application, R is run in batch
> mode as part of a tool chain for arbitrary csv files. Having found
> cases where memory usage was as high as 20x file size, and allowing
> for a copy of the the loaded dataset, I'll just need to document that
> it is possible that files as small as 1/40th of system memory may
> consume it all. That rules out some important datasets (US Census, UK
> Office of National Statistics files, etc) for 2GB servers.
> Regards, Mike
> On 8/9/07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>> On Thu, 9 Aug 2007, Charles C. Berry wrote:
>>> On Thu, 9 Aug 2007, Michael Cassin wrote:
>>>> I really appreciate the advice and this database solution will be useful to
>>>> me for other problems, but in this case I need to address the specific
>>>> problem of scan and read.* using so much memory.
>>>> Is this expected behaviour?
>> Yes, and documented in the 'R Internals' manual. That is basic reading
>> for people wishing to comment on efficiency issues in R.
>>>> Can the memory usage be explained, and can it be
>>>> made more efficient? For what it's worth, I'd be glad to try to help if the
>>>> code for scan is considered to be worth reviewing.
>>> This does not seem to be an issue with scan() per se.
>>> Notice the difference in size of big2, big3, and bigThree here:
>>>> big2 <- rep(letters,length=1e6)
>>>  4.000856
>>>> big3 <- paste(big2,big2,sep='')
>>>  36.00002
>> On a 32-bit computer every R object has an overhead of 24 or 28 bytes.
>> Character strings are R objects, but in some functions such as rep (and
>> scan for up to 10,000 distinct strings) the objects can be shared. More
>> string objects will be shared in 2.6.0 (but factors are designed to be
>> efficient at storing character vectors with few values).
>> On a 64-bit computer the overhead is usually double. So I would expect
>> just over 56 bytes/string for distinct short strings (and that is what
>> big3 gives).
>> But 56Mb is really not very much (tiny on a 64-bit computer), and 1
>> million items is a lot.
>> Brian D. Ripley, ripley at stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help