[R] vsize and nsize

Prof Brian D Ripley ripley at stats.ox.ac.uk
Tue May 18 19:38:09 CEST 1999


On Tue, 18 May 1999, Thomas Lumley wrote:

> On Tue, 18 May 1999, Jim Lindsey wrote:
> 
> > I am wondering what you mean by "R's poor handling of large datasets".
> > How large is large? I have often been working simultaneously with a
> > fair number of vectors of say 40,000 using my libraries (data objects
> > and functions) with no problems. They use the R scoping rules. On the
> > other hand, if you use dataframes and/or standard functions like glm,
> > then you are restricted to extremely small (toy) data sets. But then
> > maybe you are thinking of gigabytes of data.
> 
> While I agree that R isn't much use for really big datasets, I am a bit
> surprised by Jim's comments about glm(). I have used R (including glm)
> perfectly satisfactorily on real data sets of ten thousand or so
> observations. This isn't large by today's standards but it isn't in
> any sense toy data.

Me too, and I have done this in S as well (often with a lot less memory
usage than R). I don't believe scoping rules have anything to do with this
(and glm uses R's scoping rules as well: it is hard to use anything else in
R!), but how code is written does. Bill Venables and I had various
discussions at the last S-PLUS users' conference about whether one could
ever justify using regressions, say, with more than 10,000 cases. Points
against include

- such datasets are unlikely to be homogeneous and better analysed in
  natural strata.
- statistical significance is likely to occur at practically insignificant
  levels of effects.

and the `for' points include

- it may be a legal requirement to use all the data,
- datasets can be very unbalanced, as in 70,000 normals and 68 diseased
  (but then one can sub-sample the normals).

However, that is for statistical analysis, not dataset manipulations.

I think I started this by quoting Ross as saying that R is not designed for
large datasets (and neither was S version 3). Large was in the context of a
100Mb heap and 80Mb ncells space, which I think answers Jim's question (go
up a couple of orders of magnitude).  Remember that the S developers came
from the Unix `tools' background and said they expected tools such as awk
to be used to manipulate datasets. These days we (and probably them) prefer
more help.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list