[R] vsize and nsize

Prof Brian D Ripley ripley at stats.ox.ac.uk
Tue May 18 19:38:09 CEST 1999

On Tue, 18 May 1999, Thomas Lumley wrote:

> On Tue, 18 May 1999, Jim Lindsey wrote:
> > I am wondering what you mean by "R's poor handling of large datasets".
> > How large is large? I have often been working simultaneously with a
> > fair number of vectors of say 40,000 using my libraries (data objects
> > and functions) with no problems. They use the R scoping rules. On the
> > other hand, if you use dataframes and/or standard functions like glm,
> > then you are restricted to extremely small (toy) data sets. But then
> > maybe you are thinking of gigabytes of data.
> While I agree that R isn't much use for really big datasets, I am a bit
> surprised by Jim's comments about glm(). I have used R (including glm)
> perfectly satisfactorily on real data sets of ten thousand or so
> observations. This isn't large by today's standards but it isn't in
> any sense toy data.

Me too, and I have done this in S as well (often with a lot less memory
usage than R). I don't believe scoping rules have anything to do with this
(and glm uses R's scoping rules as well: it is hard to use anything else in
R!), but how code is written does. Bill Venables and I had various
discussions at the last S-PLUS users' conference about whether one could
ever justify using regressions, say, with more than 10,000 cases. Points
against include

- such datasets are unlikely to be homogeneous and better analysed in
  natural strata.
- statistical significance is likely to occur at practically insignificant
  levels of effects.

and the `for' points include

- it may be a legal requirement to use all the data,
- datasets can be very unbalanced, as in 70,000 normals and 68 diseased
  (but then one can sub-sample the normals).

However, that is for statistical analysis, not dataset manipulations.

I think I started this by quoting Ross as saying that R is not designed for
large datasets (and neither was S version 3). Large was in the context of a
100Mb heap and 80Mb ncells space, which I think answers Jim's question (go
up a couple of orders of magnitude).  Remember that the S developers came
from the Unix `tools' background and said they expected tools such as awk
to be used to manipulate datasets. These days we (and probably them) prefer
more help.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list