[R] Suggestion for big files [was: Re: A comment about R:]

Sun Jan 8 20:25:08 CET 2006

[Martin Maechler]

>    FrPi> Suppose the file (or tape) holds N records (N is not known
>    FrPi> in advance), from which we want a sample of M records at
>    FrPi> most. [...] If the algorithm is carefully designed, when
>    FrPi> the last (N'th) record of the file will have been processed
>    FrPi> this way, we may then have M records randomly selected from
>    FrPi> N records, in such a a way that each of the N records had an
>    FrPi> equal probability to end up in the selection of M records.  I
>    FrPi> may seek out for details if needed.

>[...] I'm also intrigued about the details of the algorithm you
>outline above.

I went into my old SPSS books and related references to find it for you, 
to no avail (yet I confess I did not try very hard).  I vaguely remember 
it was related to Spearman's correlation computation: I did find notes 
about the "severe memory limitation" of this computation, but nothing 
about the implemented workaround.  I did find other sampling devices, 
but not the very one I remember having read about, many years ago.

On the other hand, Googling tells that this topic has been much studied, 
and that Vitter's algorithm Z seems to be popular nowadays (even if not 
the simplest) because it is more efficient than others.  Google found 
a copy of the paper:

   http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf

Here is an implementation for Postgres: 

   http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php

yet I do not find it very readable -- but this is only an opinion: I'm 
rather demanding in the area of legibility, while many or most people 
are more courageous than me! :-).

-- 
François Pinard   http://pinard.progiciels-bpi.ca