[R] large data set, error: cannot allocate vector
Jason Barnhart
jasoncbarnhart at msn.com
Tue May 9 20:32:30 CEST 2006
1) So the original problem remains unsolved? You can load data but lack
memory to do more (or so it appears). It seems to me that your options are:
a) ensure that the --max-mem-size option is allowing R to utilize all
available RAM
b) sample if possible, i.e. are 20MM necessary
c) load in matrices or vectors, then "process" or analyze
d) load data in database that R connects to, use that engine for
processing
e) drop unnecessary columns from data.frame
f) analyze subsets of the data (variable-wise--review fewer vars at a
time)
g) buy more RAM (32 vs 64 bit architecture should not be the issue,
since you use LINUX)
h) ???
2) Not finding memory.limit() is very odd. You should consider reviewing
the bug reporting process to determine if this should be reported. Here's
an example of my output.
> memory.limit()
[1] 1782579200
3) This may not be the correct way to look at the timing differences you
experienced. However, it seems R is holding up well.
10MM 100MM ratio-100MM/10MM
cat 0.04 7.60 190.00
scan 9.93 92.27 9.29
ratio scan/cat 248.25 12.14
Please let me know how you resolve. I'm curious about your solution
HTH,
-jason
----- Original Message -----
From: "Robert Citek" <rwcitek at alum.calberkeley.org>
To: <r-help at stat.math.ethz.ch>
Cc: "Jason Barnhart" <jasoncbarnhart at msn.com>
Sent: Tuesday, May 09, 2006 9:22 AM
Subject: Re: [R] large data set, error: cannot allocate vector
>
> On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
>> Please try memory.limit() to confirm how much system memory is available
>> to R.
>
> Unfortunately, memory.limit() is not available:
>
> R > memory.limit()
> Error: could not find function "memory.limit"
>
> Did you mean mem.limits()?
>
> R > mem.limits()
> nsize vsize
> NA NA
>
>> Additionally, read.delim returns a data.frame. You could use the
>> colClasses
>> argument to change variable types (see example below) or use scan()
>> which
>> returns a vector. This would store the data more compactly. The vector
>> object is significantly smaller than the data.frame.
>>
>> It appears from your example session that you are examining a single
>> variable. If so, a vector would suffice.
>
> Yes, a vector worked very nicely (see below.) In fact, using the vector
> method R was able to read in the 10 MM entry data set much faster than a
> data.frame.
>
> The reason I have stayed with data.frames is because my "real" data is of
> a mixed type, much like a database table or spreadsheet. Unfortunately,
> my real data set takes too long to work with (~20 MM entries of mixed
> type which requires over 20 minutes just to load the data into R.) In
> contrast, the toy data set is about the same number of entries, but only
> a single column, which captures some of the essence of my real data set
> but is a lot faster and easier to work with.
>
>> Note in the example below, processing large numbers in the integer type
>> creates an under/over flow error.
>
> Thanks for the examples. They really help.
>
> Here's a sample transcript from a bash shell under Linux comparing some
> timings using a vector within R:
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ time -p cat dataset.010MM.txt > /dev/null
> real 0.04
> user 0.00
> sys 0.03
>
> $ time -p cat dataset.100MM.txt > /dev/null
> real 7.60
> user 0.06
> sys 0.67
>
> $ time -p wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
> real 2.38
> user 1.92
> sys 0.44
>
> $ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
> | time -p R -q --no-save
>
> R > foov <- scan("dataset.010MM.txt") ; length(foov)
> Read 10000000 items
> [1] 10000000
>
> real 9.93
> user 9.41
> sys 0.52
>
> $ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
> | time -p R -q --no-save
>
> R > foov <- scan("dataset.100MM.txt") ; length(foov)
> Read 100000000 items
> [1] 100000000
>
> real 92.27
> user 88.66
> sys 3.58
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software. Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>
More information about the R-help
mailing list