[R] large data set, error: cannot allocate vector
Robert Citek
rwcitek at alum.calberkeley.org
Tue May 9 22:27:58 CEST 2006
On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:
> 1) So the original problem remains unsolved?
The question was answered but the problem remains unsolved. The
question was, why am I getting an error "cannot allocate vector" when
reading in a 100 MM integer list. The answer appears to be:
1) R loads the entire data set into RAM
2) on a 32-bit system R max'es out at 3 GB
3) loading 100 MM integer entries into a data.frame requires more
than 3 GB of RAM (5-10 GB based on projections from 10 MM entries)
So, the new question is, how does one work around such limits?
> You can load data but lack memory to do more (or so it appears). It
> seems to me that your options are:
> a) ensure that the --max-mem-size option is allowing R to
> utilize all available RAM
--max-mem-size doesn't exist in my version:
$ R --max-mem-size
WARNING: unknown option '--max-mem-size'
Do different versions of R on different OSes and different platforms
have different options?
FWIW, here's the usage statement from ?mem.limits:
R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max-
ppsize=N
> b) sample if possible, i.e. are 20MM necessary
Yes, or within a factor of 4 of that.
> c) load in matrices or vectors, then "process" or analyze
Yes, I just need to learn more of the R language to do what I want.
> d) load data in database that R connects to, use that engine for
> processing
I have a gut feeling something like this is the way to go.
> e) drop unnecessary columns from data.frame
Yes. Currently, one of the fields is an identifier field which is a
long text field (30+ chars). That should probably be converted to an
integer to conserve on both time and space.
> f) analyze subsets of the data (variable-wise--review fewer vars
> at a time)
Possibly.
> g) buy more RAM (32 vs 64 bit architecture should not be the
> issue, since you use LINUX)
32-bit seems to be the limit. We've got 6 GB of RAM and 8 GB of
swap. Despite that R chokes well before those limits are reached.
> h) ???
Yes, possibly some other solution we haven't considered.
> 2) Not finding memory.limit() is very odd. You should consider
> reviewing the bug reporting process to determine if this should be
> reported. Here's an example of my output.
> > memory.limit()
> [1] 1782579200
Do different versions of R on different OSes and different platforms
have different functions?
> 3) This may not be the correct way to look at the timing
> differences you experienced. However, it seems R is holding up well.
>
> 10MM 100MM ratio-100MM/10MM
> cat 0.04 7.60 190.00
> scan 9.93 92.27 9.29
> ratio scan/cat 248.25 12.14
I re-ran the timing test for the 100 MM file taking caching into
account. Linux with 6 GB has no problem caching the 100 MM file (600
MB):
10MM 100MM ratio-100MM/10MM
cat 0.04 0.38 9.50
scan 9.93 92.27 9.29
ratio scan/cat 248.25 242.82
> Please let me know how you resolve. I'm curious about your solution
> HTH,
Indeed, very helpful. I'm learning more about R every day. Thanks
for your feedback.
Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software. Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent
More information about the R-help
mailing list