[Rd] Slow IO: was [R] naive question
vograno at evafunds.com
Wed Jun 30 03:46:08 CEST 2004
I believe IO in R is slow because of the way it is implemented, not
because it has to do some extra work for the user.
I compared scan() with 'what' argument set (which is, AFAIK, is the
fastest way to read a CSV file) to an equivalent C code. It turned out
to be 20 - 50 times slower.
I can see at least two main reasons why R's IO is so slow (I didn't
profile this though):
A) it reads from a connection char-by-char as opposed to the buffered
read. Reading each char requires a call to scanchar() which then calls
Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its
part is defined somewhere else (not in scan.c) and therefore the call
can not be inlined, etc.
B) mkChar, which is used very extensively, is too slow. There are ways
to minimize the number of calls to mkChar, but I won't expand on it in
I brought this up because it seems that many people believe that the
slowness is inherent and is a tradeoff for something else. I don't think
this is the case.
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Douglas Bates
Sent: Tuesday, June 29, 2004 5:56 PM
To: Igor Rivin
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] naive question
Igor Rivin wrote:
> I was not particularly annoyed, just disappointed, since R seems like
> a much better thing than SAS in general, and doing everything with a
> combination of hand-rolled tools is too much work. However, I do need
> to work with very large data sets, and if it takes 20 minutes to read
> them in, I have to explore other options (one of which might be
S-PLUS, which claims scalability as a major , er, PLUS over R).
If you are routinely working with very large data sets it would be
worthwhile learning to use a relational database (PostgreSQL, MySQL,
even Access) to store the data and then access it from R with RODBC or
one of the specialized database packages.
R is slow reading ASCII files because it is assembling the meta-data on
the fly and it is continually checking the types of the variables being
read. If you know all this information and build it into your table
definitions, reading the data will be much faster.
A disadvantage of this approach is the need to learn yet another
language and system. I was going to do an example but found I could not
because I left all my SQL books at home (I'm travelling at the moment)
and I couldn't remember the particular commands for loading a table from
an ASCII file.
R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide!
More information about the R-devel