[R] memory once again
Thomas Lumley
tlumley at u.washington.edu
Mon Mar 6 18:01:31 CET 2006
On Fri, 3 Mar 2006, Dimitri Joe wrote:
> Dear all,
>
> A few weeks ago, I asked this list why small Stata files became huge R
> files. Thomas Lumley said it was because "Stata uses single-precision
> floating point by default and can use 1-byte and 2-byte integers. R uses
> double precision floating point and four-byte integers." And it seemed I
> couldn't do anythig about it.
>
> Is it true? I mean, isn't there a (more or less simple) way to change
> how R stores data (maybe by changing the source code and compiling it)?
It's not impossible, but it really isn't as easy as you might think.
It would be relatively easy to change the definition of REALSXPs and
INTSXPs so that they stored 4-byte and 2-byte data respectively. It would
be a lot harder to go through all the C and Fortran numerical,
input/output, and other processing code to either translate from short to
long data types or to make the code work for short data types. For
example, the math functions would want to do computations in double (as
Stata does) but the input/output functions would presumably want to use
float.
Adding two more SEXP types to give eg "single" and "shortint" might be
easier (if there are enough bits left in the SEXPTYPE header), but would
still require adding code to nearly every C function in R.
Single-precision floating point has been discussed for R in the past, and
the extra effort and resulting larger code were always considered too high
a price. Since the size of data set R can handle doubles every 18 months
or so without any effort on our part it is hard to motivate diverting
effort away from problems that will not solve themselves. This doesn't
help you, of course, but it may help explain why we can't.
Another thing that might be worth pointing out: Stata also keeps all its
data in memory and so can handle only "small" data sets. One reason that
Stata is so fast and that Stata's small data sets can be larger than R's
is the more restrictive language. This is more important than the
compression from smaller data types -- you can use a dataset in Stata that
is nearly as large as available memory (or address space), which is a
factor of 3-10 better than R manages. On the other hand, for operations
that do not fit well with the Stata language structure, it is quite slow.
For example, the new Stata graphics in version 8 required some fairly
significant extensions to the language and are still notably slower than
the lattice graphics in R (a reasonably fair comparison since both are
interpreted code).
The terabyte-scale physics and astronomy data that other posters alluded
to require a much more restrictive form of programming than R to get
reasonable performance. R does not make you worry about how your data are
stored and which data access patterns are fast or slow, but if your data
are larger than memory you have to worry about these things. The
difference between one-pass and multi-pass algorithms, between O(n) and
O(n^2) time, even between sequential-access and random-access algorithms
all matter, and the language can't hide them. Fortunately, most
statistical problems are small enough to solve by throwing computing power
at them, perhaps after an initial subsampling or aggegrating phase.
The initial question was about read.dta. Now, read.dta() could almost
certainly be improved a lot, especially for wide data sets. It uses very
inefficient data frame operations to handle factors, for example. It used
to be a lot faster than read.table, but that was before Brian Ripley
improved read.table.
-thomas
More information about the R-help
mailing list