[R] Performance & capacity characteristics of R?

Sat Jul 31 18:18:51 CEST 1999

> On Tuesday, August 03, 1999 4:15 AM, Prof Brian D Ripley wrote:
> Can you tell us what statistical procedures need 1 million to 100s of
> millions or rows (observations)?  Some of us have doubted that there are
> even datasets of 100,000 examples that are homogeneous and for which a
> small subsample would not give all the statistical information. (If they
> are not homogeneous, one could/should analyse homogeneous subsets and do a
> meta-analysis.)
> 
> Your datasets appear to be (taking a mid-range value) around 1Gbyte
> in size.

I won't speak for Karsten, but will describe my own use of R for
(potentially) large datasets, just to give you an idea of what at
least one large dataset user is attempting to do...

The application is a simulation of radar detection and tracking of
aircraft.  The data collected is radar detections and tracks.  There
are potentially (though not typically) 200 radars by 200 aircraft in
the simulation.  In this extreme case, I expect to collect approx.
2GB of data in a 4-hour simulation run.  Fortunately, this is not
typical; I'm trying to get a better handle on what's typical.

There are two primary uses.  One is to produce various plots, such as
   a) detections and tracks against time, and 
   b) detections and tracks against geographic location with
      respect to true aircraft position.

The other is to perform statistical measures across multiple runs.
I don't know all the details of what functions will be performed,
but paired t-test has been mentioned.  A typical question to be
answered is: how well did a jammer perform in reducing the
number of detections and tracks?  Another one is:  how much
better did this aircraft perform (in avoiding detection)
compared to that other aircraft?  Another: which flight path is
better to avoid detection and tracking?

The largest datasets contain approx 50M observations of 20 variables
for detections.  For handling these large datasets, I'm counting on
the fact that typical analyses focus on smaller time ranges and on
specific aircraft.  So, I plan to preprocess the data to select just
the radars, aircraft, and time range of interest before loading the
data file into R.

Currently, the dataset is kept in Oracle; I hope to transition
to HDF (http://hdf.ncsa.uiuc.edu/).  Oracle has lots of advantages,
but is very slow to load this much data.  I have yet to evaluate
HDF for large datasets.

--
Terry J. Westley, Principal Engineer
Veridian Engineering, Calspan Operations
P.O. Box 400, Buffalo, NY 14225
twestley at buffalo.veridian.com    http://www.veridian.com

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._