[R] good and bad ways to import fixed column data (rpy)
Ross Boylan
ross at biostat.ucsf.edu
Sun Aug 16 22:49:53 CEST 2009
Recorded here so others may avoid my mistakes.
I have a bunch of files containing fixed width data. The R Data guide
suggests that one pre-process them with a script if they are large.
They were 50MG and up, and I needed to process another file that gave
the layout of the lines anyway.
I tried rpy to not only preprocess but create the R data object in one
go. It seemed like a good idea; it wasn't. The core operation, was to
build up a string for each line that looked like "data.frame(var1=val1,
var2=val2, [etc])" and then rbind this to the data.frame so far. I did
this with r(mycommand string). Almost all the values were numeric.
This was incredibly slow, being unable to complete after running
overnight.
So, the lesson is, don't do that!
I switched to preprocessing that created a csv file, and then read.csv
from R. This worked in under a minute. The result had dimension 150913
x 129.
The good news in rpy was that I found objects persisted across calls to
the r object.
Exactly why this was so slow I don't know. The two obvious suspects the
speed of rbind, which I think is pretty inefficient, and the overhead of
crossing the python/R boundary.
This was on Debian Lenny:
python-rpy 1.0.3-2
Python 2.5.2
R 2.7.1
rpy2 is not available in Lenny, though it is in development versions of
Debian.
Ross Boylan
More information about the R-help
mailing list