Representation of data in libraries

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Wed, 25 Feb 1998 08:34:03 +0100


>>>>> "DougB" == Douglas Bates <bates@stat.wisc.edu> writes:

    DougB> At present the example data sets in R libraries are to be given as
    DougB> expressions that can be read directly into R.  For example, the acid.R 
    DougB> file in the main library looks like
    DougB> acid <- data.frame(
    DougB> carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
    DougB> optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6))

    DougB> This is great when you have only a few observations.  I have one
    DougB> example data set with over 9000 rows and 17 variables.  Even when I
    DougB> set -v 40, I exhaust the available memory trying to read it in as a
    DougB> data.frame.  I believe this is because of the recursive nature of the
    DougB> parsing of data objects.

yes; 

    DougB> Are there alternatives that would cause less memory usage?

yes; but only in the 0.62 development version.
The current 0.62 ``standard'' is:

if a 'data' file ends in
	.R,	source(.) is used to read it
if it ends in
	.tab	read.table(..., header = TRUE)  is used to read it.
(you find the new data(.) function in  src/library/base/data in R-snapshot.)

Note that this is still not really satisfactory for large data files,
since read.table(.) is not really efficient:
	it first reads everything as character matrix and then converts
	variable by variable, some to numeric, some to factor.

On the other hand: does it really make sense to distribute huge example
data sets as yours above?
If yes, AND if you have only numeric data,
I'd propose the following:
 1) create a  <pkg>/data/dougBex.R
    file which only contains something like
	dougBex <- as.data.frame(
		    matrix(scan(system.file("<pkg>/data/dougBex.dat")),
			   ncol = ...,  
			   dimnames = ...))
 2) create   <pkg>/data/dougBex.dat  to contain all your data, white-space
				     delimited numeric.


    DougB> In S/S-PLUS the data.dump/data.restore functions use a portable
    DougB> representation that can be parsed without exponential memory growth.

hmm, yes, we have been longing for someone to write  data.dump/data.restore
for R.
	Any volunteers?

--
Martin Maechler <maechler@stat.math.ethz.ch>			<><
Seminar fuer Statistik, ETH-Zentrum SOL G1;	Sonneggstr.33
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1086
http://www.stat.math.ethz.ch/~maechler/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._