[R] large data set, error: cannot allocate vector
Robert Citek
rwcitek at alum.calberkeley.org
Tue May 9 18:22:58 CEST 2006
On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
> Please try memory.limit() to confirm how much system memory is
> available to R.
Unfortunately, memory.limit() is not available:
R > memory.limit()
Error: could not find function "memory.limit"
Did you mean mem.limits()?
R > mem.limits()
nsize vsize
NA NA
> Additionally, read.delim returns a data.frame. You could use the
> colClasses
> argument to change variable types (see example below) or use scan()
> which
> returns a vector. This would store the data more compactly. The
> vector
> object is significantly smaller than the data.frame.
>
> It appears from your example session that you are examining a single
> variable. If so, a vector would suffice.
Yes, a vector worked very nicely (see below.) In fact, using the
vector method R was able to read in the 10 MM entry data set much
faster than a data.frame.
The reason I have stayed with data.frames is because my "real" data
is of a mixed type, much like a database table or spreadsheet.
Unfortunately, my real data set takes too long to work with (~20 MM
entries of mixed type which requires over 20 minutes just to load the
data into R.) In contrast, the toy data set is about the same number
of entries, but only a single column, which captures some of the
essence of my real data set but is a lot faster and easier to work with.
> Note in the example below, processing large numbers in the integer
> type
> creates an under/over flow error.
Thanks for the examples. They really help.
Here's a sample transcript from a bash shell under Linux comparing
some timings using a vector within R:
$ uname -sorv ; rpm -q R ; R --version
Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
R-2.3.0-2.fc4
R version 2.3.0 (2006-04-24)
Copyright (C) 2006 R Development Core Team
$ time -p cat dataset.010MM.txt > /dev/null
real 0.04
user 0.00
sys 0.03
$ time -p cat dataset.100MM.txt > /dev/null
real 7.60
user 0.06
sys 0.67
$ time -p wc -l dataset.100MM.txt
100000000 dataset.100MM.txt
real 2.38
user 1.92
sys 0.44
$ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
| time -p R -q --no-save
R > foov <- scan("dataset.010MM.txt") ; length(foov)
Read 10000000 items
[1] 10000000
real 9.93
user 9.41
sys 0.52
$ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
| time -p R -q --no-save
R > foov <- scan("dataset.100MM.txt") ; length(foov)
Read 100000000 items
[1] 100000000
real 92.27
user 88.66
sys 3.58
Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software. Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent
More information about the R-help
mailing list