[R] large data set, error: cannot allocate vector

Robert Citek rwcitek at alum.calberkeley.org
Tue May 9 18:22:58 CEST 2006


On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
> Please try memory.limit() to confirm how much system memory is  
> available to R.

Unfortunately, memory.limit() is not available:

R > memory.limit()
Error: could not find function "memory.limit"

Did you mean mem.limits()?

R > mem.limits()
nsize vsize
    NA    NA

> Additionally, read.delim returns a data.frame.  You could use the  
> colClasses
> argument to change variable types (see example below) or use scan()  
> which
> returns a vector.  This would store the data more compactly.  The  
> vector
> object is significantly smaller than the data.frame.
>
> It appears from your example session that you are examining a single
> variable.  If so, a vector would suffice.

Yes, a vector worked very nicely (see below.)  In fact, using the  
vector method R was able to read in the 10 MM entry data set much  
faster than a data.frame.

The reason I have stayed with data.frames is because my "real" data  
is of a mixed type, much like a database table or spreadsheet.   
Unfortunately, my real data set takes too long to work with (~20 MM  
entries of mixed type which requires over 20 minutes just to load the  
data into R.)  In contrast, the toy data set is about the same number  
of entries, but only a single column, which captures some of the  
essence of my real data set but is a lot faster and easier to work with.

> Note in the example below, processing large numbers in the integer  
> type
> creates an under/over flow error.

Thanks for the examples.  They really help.

Here's a sample transcript from a bash shell under Linux comparing  
some timings using a vector within R:

$ uname -sorv ; rpm -q R ; R --version
Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
R-2.3.0-2.fc4
R version 2.3.0 (2006-04-24)
Copyright (C) 2006 R Development Core Team

$ time -p cat dataset.010MM.txt > /dev/null
real 0.04
user 0.00
sys 0.03

$ time -p cat dataset.100MM.txt > /dev/null
real 7.60
user 0.06
sys 0.67

$ time -p wc -l dataset.100MM.txt
100000000 dataset.100MM.txt
real 2.38
user 1.92
sys 0.44

$ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
   | time -p R -q --no-save

R > foov <- scan("dataset.010MM.txt") ; length(foov)
Read 10000000 items
[1] 10000000

real 9.93
user 9.41
sys 0.52

$ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
   | time -p R -q --no-save

R > foov <- scan("dataset.100MM.txt") ; length(foov)
Read 100000000 items
[1] 100000000

real 92.27
user 88.66
sys 3.58

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent




More information about the R-help mailing list