Representation of data in libraries

Thomas Lumley thomas@biostat.washington.edu
Tue, 24 Feb 1998 13:16:22 -0800 (PST)


On 24 Feb 1998, Douglas Bates wrote:

> At present the example data sets in R libraries are to be given as
> expressions that can be read directly into R.  For example, the acid.R 
> file in the main library looks like
>  acid <- data.frame(
>   carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
>   optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6))
> 
> This is great when you have only a few observations.  I have one
> example data set with over 9000 rows and 17 variables.  Even when I
> set -v 40, I exhaust the available memory trying to read it in as a
> data.frame. 

You need to specify -n some_large_number to read in large data sets,
specifying -v is not enough.  You can see this by using gcinfo(T) to
report heap and cons cell usage at each garbage collection.

> Are there alternatives that would cause less memory usage?  In
> S/S-PLUS the data.dump/data.restore functions use a portable
> representation that can be parsed without exponential memory growth.

The R save() format is portable, at least among Unices. You could have the
data.R file contain the command
	eval(load("data.Rdata"),.GlobalEnv)
where "data.Rdata" is the saved file. There is an ascii=T option, which
might make it more portable to other operating systems.

I haven't checked, but I assume that this format can be read more
efficiently than sourcing R code.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._