[R] large data set, error: cannot allocate vector

Robert Citek rwcitek at alum.calberkeley.org
Tue May 9 22:27:58 CEST 2006


On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:

> 1) So the original problem remains unsolved?

The question was answered but the problem remains unsolved.  The  
question was, why am I getting an error "cannot allocate vector" when  
reading in a 100 MM integer list.  The answer appears to be:

1) R loads the entire data set into RAM
2) on a 32-bit system R max'es out at 3 GB
3) loading 100 MM integer entries into a data.frame requires more  
than 3 GB of RAM (5-10 GB based on projections from 10 MM entries)

So, the new question is, how does one work around such limits?

> You can load data but lack memory to do more (or so it appears). It  
> seems to me that your options are:
>    a) ensure that the --max-mem-size option is allowing R to  
> utilize all available RAM

--max-mem-size doesn't exist in my version:

$ R --max-mem-size
WARNING: unknown option '--max-mem-size'

Do different versions of R on different OSes and different platforms  
have different options?

FWIW, here's the usage statement from ?mem.limits:

R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max- 
ppsize=N

>    b) sample if possible, i.e. are 20MM necessary

Yes, or within a factor of 4 of that.

>    c) load in matrices or vectors, then "process" or analyze

Yes, I just need to learn more of the R language to do what I want.

>    d) load data in database that R connects to, use that engine for  
> processing

I have a gut feeling something like this is the way to go.

>    e) drop unnecessary columns from data.frame

Yes.  Currently, one of the fields is an identifier field which is a  
long text field (30+ chars).  That should probably be converted to an  
integer to conserve on both time and space.

>    f) analyze subsets of the data (variable-wise--review fewer vars  
> at a time)

Possibly.

>    g) buy more RAM (32 vs 64 bit architecture should not be the  
> issue, since you use LINUX)

32-bit seems to be the limit.  We've got 6 GB of RAM and 8 GB of  
swap.  Despite that R chokes well before those limits are reached.

>    h) ???

Yes, possibly some other solution we haven't considered.

> 2) Not finding memory.limit() is very odd.  You should consider  
> reviewing the bug reporting process to determine if this should be  
> reported.  Here's an example of my output.
>    > memory.limit()
>    [1] 1782579200

Do different versions of R on different OSes and different platforms  
have different functions?

> 3) This may not be the correct way to look at the timing  
> differences you experienced. However, it seems R is holding up well.
>
>                    10MM  100MM  ratio-100MM/10MM
>           cat      0.04   7.60  190.00
>          scan      9.93  92.27    9.29
> ratio scan/cat    248.25  12.14

I re-ran the timing test for the 100 MM file taking caching into  
account.  Linux with 6 GB has no problem caching the 100 MM file (600  
MB):

                     10MM    100MM  ratio-100MM/10MM
           cat       0.04     0.38    9.50
          scan       9.93    92.27    9.29
ratio scan/cat    248.25   242.82

> Please let me know how you resolve.  I'm curious about your solution
> HTH,

Indeed, very helpful.  I'm learning more about R every day.  Thanks  
for your feedback.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent




More information about the R-help mailing list