Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Feb 15 18:51:18 CET 2005


On Tue, 15 Feb 2005, Graham Jones wrote:

> In message <200502151112.j1FB5fZ5002722 at hypatia.math.ethz.ch>, r-help-
> request at stat.math.ethz.ch writes

[Actually quoting Bert Gunter, BTW]

>> Can comeone give me an example (perhaps in a private response, since I'm off
>> topic here) where one actually needs all cases in a large data set ("large"
>> being > 1e6, say) to do a STATISTICAL analysis? By "statistical" I exclude,
>> say searching for some particular characteristic like an adverse event in a
>> medical or customer repair database, etc. Maybe a definition of
>> "statistical" is: anything that cannot be routinely done in a single pass
>> database query.
>
> If the dimensionality of the data is large, you may need a large number
> of cases too. An example from my own experience would be using quadratic
> discriminant analysis (with regularization) for classifying symbols for
> an OCR program. With 200 classes and 100 features, I'd really like many
> millions of cases. I've been using about 20,000 per class or 4 million
> in total, but if I had 40 million it would probably work better.
> Compared to many applications in pattern recognition and data mining, I
> think this is a fairly small example.

But Bert's caveats apply: you have 200 problems of size 20,000 since in 
QDA each class's distribution is estimated separately, and a single pass 
will give you the sufficient statistics however large the dataset is.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list