[R] Hand-crafting an .RData file
Adam D. I. Kramer
adik at ilovebacon.org
Tue Nov 10 07:31:39 CET 2009
Thanks as always for a very helpful response. I'm now loading a few million
rows in only a few seconds.
Cordially,
Adam Kramer
On Mon, 9 Nov 2009, Prof Brian Ripley wrote:
> The R 'save' format (as used for the saved workspace .RData) is described in
> the 'R Internals' manual (section 1.8). It is intended for R objects, and
> you would first have to create one[*] of those in your other application.
> That seems a lot of work.
>
> The normal way to transfer numeric data between applications is to write a
> binary file: R can read such files with readBin(), and it also has
> wrappers/C-code to read a number of commmon binary data formats (e.g. those
> from SPSS).
>
> With character data there are more issues (and more formats, see also
> readChar()), but load() is not particularly fast for those.
>
> Ultimately the R functions pay a performance price for their flexibility so
> hand-crafted C code to read the format can be worthwhile: but see the
> comments below about whether I/O speed is that important.
>
> [*] the 'save' format is a serialization of a single R object, even if you
> save many objects, since the object(s) are combined into a pairlist.
>
> On Sun, 8 Nov 2009, Adam D. I. Kramer wrote:
>
>> Hello,
>>
>> I frequently have to export a large quantity of data from some
>> source (for example, a database, or a hand-written perl script) and then
>> read it into R. This occasionally takes a lot of time; I'm usually using
>> read.table("filename",comment.char="",quote="") to read the data once it is
>> written to disk.
>
> Specifying colClasses and nrows will usually help.
>
> To read from a database, packages such as RODBC use binary data transfer:
> with suitable tuning this can be fast.
>
>> However, I *know* that the program that generates the data is more
>> or less just calling printf in a for loop to create the csv or
>> tab-delimited
>> file, writing, then having R parse it, which is pretty inefficient.
>> Instead, I am interested in figuring out how to write the data in .RData
>> format so that I can load() it instead of read.table() it.
>
> Without more details it is hard to say if it is inefficient. read.table() can
> read data pretty fast (millions of items per second) if used following the
> hints in the 'R Data' manual. See e.g.
> https://stat.ethz.ch/pipermail/r-devel/2004-December/031733.html
>
> Almost anything non-trivial one might do with such data is much slower. The
> trend is to write richer (and slower to read) data formats.
>
>> Trolling the internet, however, has not suggested anything about the
>> specification for an .RData file. Could somebody link me to a specification
>> or some information that would instruct me on how to construct a .RData
>> file (either compressed or uncompressed)?
>>
>> Also, I am open to other suggestions of how to get load()-like
>> efficiency in some other way.
>>
>> Many thanks,
>> Adam D. I. Kramer
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list