[R] large data set, error: cannot allocate vector

Tue May 9 23:32:45 CEST 2006

Robert,

Thanks, I stand corrected on the RAM issue re: 32 vs. 64 bit builds.

As for the --max-memory-size option, I'll try to check my LINUX version at 
home tonight.

-jason

----- Original Message ----- 
From: "Robert Citek" <rwcitek at alum.calberkeley.org>
To: <r-help at stat.math.ethz.ch>
Cc: "Jason Barnhart" <jasoncbarnhart at msn.com>
Sent: Tuesday, May 09, 2006 1:27 PM
Subject: Re: [R] large data set, error: cannot allocate vector

>
> On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:
>
>> 1) So the original problem remains unsolved?
>
> The question was answered but the problem remains unsolved.  The  question 
> was, why am I getting an error "cannot allocate vector" when  reading in a 
> 100 MM integer list.  The answer appears to be:
>
> 1) R loads the entire data set into RAM
> 2) on a 32-bit system R max'es out at 3 GB
> 3) loading 100 MM integer entries into a data.frame requires more  than 3 
> GB of RAM (5-10 GB based on projections from 10 MM entries)
>
> So, the new question is, how does one work around such limits?
>
>> You can load data but lack memory to do more (or so it appears). It 
>> seems to me that your options are:
>>    a) ensure that the --max-mem-size option is allowing R to  utilize all 
>> available RAM
>
> --max-mem-size doesn't exist in my version:
>
> $ R --max-mem-size
> WARNING: unknown option '--max-mem-size'
>
> Do different versions of R on different OSes and different platforms  have 
> different options?
>
> FWIW, here's the usage statement from ?mem.limits:
>
> R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max- 
> ppsize=N
>
>>    b) sample if possible, i.e. are 20MM necessary
>
> Yes, or within a factor of 4 of that.
>
>>    c) load in matrices or vectors, then "process" or analyze
>
> Yes, I just need to learn more of the R language to do what I want.
>
>>    d) load data in database that R connects to, use that engine for 
>> processing
>
> I have a gut feeling something like this is the way to go.
>
>>    e) drop unnecessary columns from data.frame
>
> Yes.  Currently, one of the fields is an identifier field which is a  long 
> text field (30+ chars).  That should probably be converted to an  integer 
> to conserve on both time and space.
>
>>    f) analyze subsets of the data (variable-wise--review fewer vars  at a 
>> time)
>
> Possibly.
>
>>    g) buy more RAM (32 vs 64 bit architecture should not be the  issue, 
>> since you use LINUX)
>
> 32-bit seems to be the limit.  We've got 6 GB of RAM and 8 GB of  swap. 
> Despite that R chokes well before those limits are reached.
>
>>    h) ???
>
> Yes, possibly some other solution we haven't considered.
>
>> 2) Not finding memory.limit() is very odd.  You should consider 
>> reviewing the bug reporting process to determine if this should be 
>> reported.  Here's an example of my output.
>>    > memory.limit()
>>    [1] 1782579200
>
> Do different versions of R on different OSes and different platforms  have 
> different functions?
>
>> 3) This may not be the correct way to look at the timing  differences you 
>> experienced. However, it seems R is holding up well.
>>
>>                    10MM  100MM  ratio-100MM/10MM
>>           cat      0.04   7.60  190.00
>>          scan      9.93  92.27    9.29
>> ratio scan/cat    248.25  12.14
>
> I re-ran the timing test for the 100 MM file taking caching into  account. 
> Linux with 6 GB has no problem caching the 100 MM file (600  MB):
>
>                     10MM    100MM  ratio-100MM/10MM
>           cat       0.04     0.38    9.50
>          scan       9.93    92.27    9.29
> ratio scan/cat    248.25   242.82
>
>> Please let me know how you resolve.  I'm curious about your solution
>> HTH,
>
> Indeed, very helpful.  I'm learning more about R every day.  Thanks  for 
> your feedback.
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>