[R] naive question
Richard A. O'Keefe
ok at cs.otago.ac.nz
Fri Jul 2 02:22:21 CEST 2004
As part of a continuing thread on the cost of loading large
amounts of data into R,
"Vadim Ogranovich" <vograno at evafunds.com> wrote:
R's IO is indeed 20 - 50 times slower than that of equivalent C code
no matter what you do, which has been a pain for some of us.
I wondered to myself just how bad R is at reading,
when it is given a fair chance. So I performed an experiment.
My machine (according to "Workstation Info") is a SunBlade 100 with 640MB
of physical memory running SunOS 5.9 Generic, according to fpversion this
is an Ultra2e with the CPU clock running at 500MHz and the main memory
clock running at 84MHz (wow, slow memory). R.version is
platform sparc-sun-solaris2.9
arch sparc
os solaris2.9
system sparc, solaris2.9
status
major 1
minor 9.0
year 2004
month 04
day 12
language R
and althnough this is a 64-bit machine, it's a 32-bit installation of R.
The experiment was this:
(1) I wrote a C program that generated 12500 rows of 800 columns, the
numbers were integers 0..999,999,999 generated using drand48().
These numbers were written using printf(). It is possible to do
quite a bit better by avoiding printf(), but that would ruin the
spirit of the comparison, which is to see what can be done with
*straightforward* code using *existing* library functions.
21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.
The sizes were chosen to get 100MB; the actual size was
12500 (lines) 10000000 (words) 100012500 (bytes)
(2) I wrote a C program that read these numbers using scanf("%d"); it
"knew" there were 800 numbers per row and 12500 numbers in all.
Again, it is possible to do better by avoiding scanf(), but the
point is to look at *straightforward* code.
18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.
(3) I started R, played around a bit doing other things, then issued this
command:
> system.time(xx <- read.table("/tmp/big.dat", header=FALSE, quote="",
+ row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
+ comment.char="")
So how long _did_ it take to read 100MB on this machine?
71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.
The result: the R/C ratio was less than 4, whether you measure cpu time
or real time. It certainly wasn't anywhere near 20-50 times slower.
Of course, *binary* I/O in C *would* be quite a bit faster:
(1') generate same integers but write a row at a time using fwrite():
5 seconds cpu, 25 seconds real; 40 MB.
(2') read same integers a row at a time using fread()
0.26 seconds cpu, 1 second real.
This would appear to more than justify "20-50 times slower", but reading
binary data and reading data in a textual representation are different
things, "less than 4 times slower" is the fairer measure. However, it
does emphasise the usefulness of problem-specific bulk reading techniques.
I thought I'd give you another R measurement:
> system.time(xx <- read.table("/tmp/big.dat", header=FALSE))
But I got sick of waiting for it, and killed it after 843 cpu seconds,
3075 real seconds. Without knowing how far it had got, one can say no
more than that this is at least 10 times slower than the more informed
call to read.table.
What this tells me is that if you know something about the data that
you _could_ tell read.table about, you do yourself no favour by keeping
read.table in the dark. All those options are there for a reason, and
it *will* pay to use them.
More information about the R-help
mailing list