[R] large data set, error: cannot allocate vector

Jason Barnhart jasoncbarnhart at msn.com
Tue May 9 20:32:30 CEST 2006


1) So the original problem remains unsolved?  You can load data but lack 
memory to do more (or so it appears). It seems to me that your options are:
    a) ensure that the --max-mem-size option is allowing R to utilize all 
available RAM
    b) sample if possible, i.e. are 20MM necessary
    c) load in matrices or vectors, then "process" or analyze
    d) load data in database that R connects to, use that engine for 
processing
    e) drop unnecessary columns from data.frame
    f) analyze subsets of the data (variable-wise--review fewer vars at a 
time)
    g) buy more RAM (32 vs 64 bit architecture should not be the issue, 
since you use LINUX)
    h) ???

2) Not finding memory.limit() is very odd.  You should consider reviewing 
the bug reporting process to determine if this should be reported.  Here's 
an example of my output.
    > memory.limit()
    [1] 1782579200

3) This may not be the correct way to look at the timing differences you 
experienced. However, it seems R is holding up well.

                    10MM  100MM  ratio-100MM/10MM
           cat      0.04   7.60  190.00
          scan      9.93  92.27    9.29
ratio scan/cat    248.25  12.14

Please let me know how you resolve.  I'm curious about your solution
HTH,
-jason


----- Original Message ----- 
From: "Robert Citek" <rwcitek at alum.calberkeley.org>
To: <r-help at stat.math.ethz.ch>
Cc: "Jason Barnhart" <jasoncbarnhart at msn.com>
Sent: Tuesday, May 09, 2006 9:22 AM
Subject: Re: [R] large data set, error: cannot allocate vector


>
> On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
>> Please try memory.limit() to confirm how much system memory is  available 
>> to R.
>
> Unfortunately, memory.limit() is not available:
>
> R > memory.limit()
> Error: could not find function "memory.limit"
>
> Did you mean mem.limits()?
>
> R > mem.limits()
> nsize vsize
>    NA    NA
>
>> Additionally, read.delim returns a data.frame.  You could use the 
>> colClasses
>> argument to change variable types (see example below) or use scan() 
>> which
>> returns a vector.  This would store the data more compactly.  The  vector
>> object is significantly smaller than the data.frame.
>>
>> It appears from your example session that you are examining a single
>> variable.  If so, a vector would suffice.
>
> Yes, a vector worked very nicely (see below.)  In fact, using the  vector 
> method R was able to read in the 10 MM entry data set much  faster than a 
> data.frame.
>
> The reason I have stayed with data.frames is because my "real" data  is of 
> a mixed type, much like a database table or spreadsheet.   Unfortunately, 
> my real data set takes too long to work with (~20 MM  entries of mixed 
> type which requires over 20 minutes just to load the  data into R.)  In 
> contrast, the toy data set is about the same number  of entries, but only 
> a single column, which captures some of the  essence of my real data set 
> but is a lot faster and easier to work with.
>
>> Note in the example below, processing large numbers in the integer  type
>> creates an under/over flow error.
>
> Thanks for the examples.  They really help.
>
> Here's a sample transcript from a bash shell under Linux comparing  some 
> timings using a vector within R:
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ time -p cat dataset.010MM.txt > /dev/null
> real 0.04
> user 0.00
> sys 0.03
>
> $ time -p cat dataset.100MM.txt > /dev/null
> real 7.60
> user 0.06
> sys 0.67
>
> $ time -p wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
> real 2.38
> user 1.92
> sys 0.44
>
> $ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
>   | time -p R -q --no-save
>
> R > foov <- scan("dataset.010MM.txt") ; length(foov)
> Read 10000000 items
> [1] 10000000
>
> real 9.93
> user 9.41
> sys 0.52
>
> $ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
>   | time -p R -q --no-save
>
> R > foov <- scan("dataset.100MM.txt") ; length(foov)
> Read 100000000 items
> [1] 100000000
>
> real 92.27
> user 88.66
> sys 3.58
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>




More information about the R-help mailing list