[R] Suggestion for big files [was: Re: A comment about R:]

Fri Jan 6 04:41:21 CET 2006

[Brian Ripley]

>I rather thought that using a DBMS was standard practice in the 
>R community for those using large datasets: it gets discussed rather 
>often.

Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

>Another possibility is to make use of the several DBMS interfaces already 
>available for R.  It is very easy to pull in a sample from one of those, 
>and surely keeping such large data files as ASCII not good practice.

Selecting a sample is easy.  Yet, I'm not aware of any SQL device for 
easily selecting a _random_ sample of the records of a given table.  On 
the other hand, I'm no SQL specialist, others might know better.

We do not have a need yet for samples where I work, but if we ever need 
such, they will have to be random, or else, I will always fear biases.

>One problem with Francois Pinard's suggestion (the credit has got lost) 
>is that R's I/O is not line-oriented but stream-oriented.  So selecting 
>lines is not particularly easy in R.

I understand that you mean random access to lines, instead of random 
selection of lines.  Once again, this chat comes out of reading someone 
else's problem, this is not a problem I actually have.  SPSS was not 
randomly accessing lines, as data files could well be hold on magnetic 
tapes, where random access is not possible on average practice.  SPSS 
reads (or was reading) lines sequentially from beginning to end, and the 
_random_ sample is built while the reading goes.

Suppose the file (or tape) holds N records (N is not known in advance), 
from which we want a sample of M records at most.  If N <= M, then we 
use the whole file, no sampling is possible nor necessary.  Otherwise, 
we first initialise M records with the first M records of the file.  
Then, for each record in the file after the M'th, the algorithm has to 
decide if the record just read will be discarded or if it will replace 
one of the M records already saved, and in the latter case, which of 
those records will be replaced.  If the algorithm is carefully designed, 
when the last (N'th) record of the file will have been processed this 
way, we may then have M records randomly selected from N records, in 
such a a way that each of the N records had an equal probability to end 
up in the selection of M records.  I may seek out for details if needed.

This is my suggestion, or in fact, more a thought that a suggestion.  It 
might represent something useful either for flat ASCII files or even for 
a stream of records coming out of a database, if those effectively do 
not offer ready random sampling devices.

P.S. - In the (rather unlikely, I admit) case the gang I'm part of would 
have the need described above, and if I then dared implementing it 
myself, would it be welcome?

-- 
François Pinard   http://pinard.progiciels-bpi.ca