[Rd] RData File Specification?

Fri Aug 24 20:06:52 CEST 2007

I was going to write 'Use the source, Luke', but it seems that you have 
alreday found the relevant source files. I wrote a Python baed Rdata 
writer and a reader sometimes ago just using that info and I am not
away of any file spec, so I know those two files are sufficient. For 
what you want to do, I think you'll have to write some fairly 
substantial code to process the Rdata as just XDR stream (as my python 
scripts do, using the python built-in xdrlib),
because as far as I know the API you are after is not exposed - you'll
have to - and you can - cut and paste a substantial part of saveload.c
and serialize.c for that matter, of course.

I think my python-based Rdata reader would do most of what you want
(it was written for mostly diagnostic purposes as I was 'hand-crafting'
R objects in C and saving them as Rdata then read it tell me what's 
wrong with them, if any) except it dumps a sort of general human 
readable ascii text format rather than csv...

My sugegstion would be to use a lanaguage you are comfortable with which 
comes with an xdr library, and just do it by hand...

Cook, Ian wrote:
> Hi,
> 
> I am developing a tool for converting a large data frame stored in an uncompressed binary (XDR) RData file to a delimited text file.  The data frame is too large to load() and extract rows from on a typical PC.  I'm looking to parse through the file and extract individual entries without loading the whole thing into memory.
> 
> In terms of some C source functions, instead of doing RestoreToEnv(R_Unserialize(connection)) which is essentially what load() does, I'm looking to get the documentation I would need to build a function "SaveToCSV()" so that I could do SaveToCSV(R_Unserialize(connection)).
> 
> Where can I get documentation on the RData file format?  Does a spec document exist?
> 
> See details below.
> 
> Thanks,
> Ian
> 
> Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com
> 
> -------------------------
> 
> Additional details:
> 
> I've browsed through the relevant source code (saveload.c, serialize.c) for ideas.
> 
> Here's a demo of the problem I'm looking to solve:
> 
> # create a sample data frame
> ds <- data.frame(row1=c(1,2,3),row2=c('a','b','c'))
> # save into an uncompressed binary R dataset
> save(ds,file="ds.rdata",compress=FALSE)
> rm(ds)
> 
> # Then load() can be simulated like this:
> 
> # create and open a file connection
> con <- file("ds.rdata",open="rb")
> # read the first 5 characters
> readChar(con,5)
> # unserialize the remainder and restore to the environment
> ds <- unserialize(con,NULL)[["ds"]]
> close(con)
> 
> But this takes up too much memory if the data set is too big.  I can read in the file character-by-character, i.e. using readChar(), but it's obvious that the file format is not trivial.  readChar(con,10000) for this demo yields:
> 
> RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003?ð\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0þ\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002€\0\0\0\0\0\0\003\0\0\004\002\0\0\003ÿ\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0þ\0\0\0þ
> 
> This would be parse-able if I had a file spec.  Thanks.
> 
> Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel