[R] memory once again

Mon Mar 6 18:01:31 CET 2006

On Fri, 3 Mar 2006, Dimitri Joe wrote:

> Dear all,
>
> A few weeks ago, I asked this list why small Stata files became huge R
> files. Thomas Lumley said it was because "Stata uses single-precision
> floating point by default and can use 1-byte and 2-byte integers. R uses
> double precision floating point and four-byte integers." And it seemed I
> couldn't do anythig about it.
>
> Is it true? I mean, isn't there a (more or less simple) way to change
> how R stores data (maybe by changing the source code and compiling it)?

It's not impossible, but it really isn't as easy as you might think.

It would be relatively easy to change the definition of REALSXPs and 
INTSXPs so that they stored 4-byte and 2-byte data respectively.  It would 
be a lot harder to go through all the C and Fortran numerical, 
input/output, and other processing code to either translate from short to 
long data types or to make the code work for short data types.  For 
example, the math functions would want to do computations in double (as 
Stata does) but the input/output functions would presumably want to use 
float.

Adding two more SEXP types to give eg "single" and "shortint" might be 
easier (if there are enough bits left in the SEXPTYPE header), but would 
still require adding code to nearly every C function in R.

Single-precision floating point has been discussed for R in the past, and 
the extra effort and resulting larger code were always considered too high 
a price.  Since the size of data set R can handle doubles every 18 months 
or so without any effort on our part it is hard to motivate diverting 
effort away from problems that will not solve themselves.  This doesn't 
help you, of course, but it may help explain why we can't.

Another thing that might be worth pointing out: Stata also keeps all its 
data in memory and so can handle only "small" data sets.  One reason that 
Stata is so fast and that Stata's small data sets can be larger than R's 
is the more restrictive language. This is more important than the 
compression from smaller data types -- you can use a dataset in Stata that 
is nearly as large as available memory (or address space), which is a 
factor of 3-10 better than R manages. On the other hand, for operations 
that do not fit well with the Stata language structure, it is quite slow. 
For example, the new Stata graphics in version 8 required some fairly 
significant extensions to the language and are still notably slower than 
the lattice graphics in R (a reasonably fair comparison since both are 
interpreted code).

The terabyte-scale physics and astronomy data that other posters alluded 
to require a much more restrictive form of programming than R to get 
reasonable performance.  R does not make you worry about how your data are 
stored and which data access patterns are fast or slow, but if your data 
are larger than memory you have to worry about these things. The 
difference between one-pass and multi-pass algorithms, between O(n) and 
O(n^2) time, even between sequential-access and random-access algorithms 
all matter, and the language can't hide them. Fortunately, most 
statistical problems are small enough to solve by throwing computing power 
at them, perhaps after an initial subsampling or aggegrating phase.

The initial question was about read.dta. Now, read.dta() could almost 
certainly be improved a lot, especially for wide data sets. It uses very 
inefficient data frame operations to handle factors, for example.  It used 
to be a lot faster than read.table, but that was before Brian Ripley 
improved read.table.

 	-thomas