[R] Suggestion for big files [was: Re: A comment about R:]
François Pinard
pinard at iro.umontreal.ca
Fri Jan 6 04:41:21 CET 2006
[Brian Ripley]
>I rather thought that using a DBMS was standard practice in the
>R community for those using large datasets: it gets discussed rather
>often.
Indeed. (I tried RMySQL even before speaking of R to my co-workers.)
>Another possibility is to make use of the several DBMS interfaces already
>available for R. It is very easy to pull in a sample from one of those,
>and surely keeping such large data files as ASCII not good practice.
Selecting a sample is easy. Yet, I'm not aware of any SQL device for
easily selecting a _random_ sample of the records of a given table. On
the other hand, I'm no SQL specialist, others might know better.
We do not have a need yet for samples where I work, but if we ever need
such, they will have to be random, or else, I will always fear biases.
>One problem with Francois Pinard's suggestion (the credit has got lost)
>is that R's I/O is not line-oriented but stream-oriented. So selecting
>lines is not particularly easy in R.
I understand that you mean random access to lines, instead of random
selection of lines. Once again, this chat comes out of reading someone
else's problem, this is not a problem I actually have. SPSS was not
randomly accessing lines, as data files could well be hold on magnetic
tapes, where random access is not possible on average practice. SPSS
reads (or was reading) lines sequentially from beginning to end, and the
_random_ sample is built while the reading goes.
Suppose the file (or tape) holds N records (N is not known in advance),
from which we want a sample of M records at most. If N <= M, then we
use the whole file, no sampling is possible nor necessary. Otherwise,
we first initialise M records with the first M records of the file.
Then, for each record in the file after the M'th, the algorithm has to
decide if the record just read will be discarded or if it will replace
one of the M records already saved, and in the latter case, which of
those records will be replaced. If the algorithm is carefully designed,
when the last (N'th) record of the file will have been processed this
way, we may then have M records randomly selected from N records, in
such a a way that each of the N records had an equal probability to end
up in the selection of M records. I may seek out for details if needed.
This is my suggestion, or in fact, more a thought that a suggestion. It
might represent something useful either for flat ASCII files or even for
a stream of records coming out of a database, if those effectively do
not offer ready random sampling devices.
P.S. - In the (rather unlikely, I admit) case the gang I'm part of would
have the need described above, and if I then dared implementing it
myself, would it be welcome?
--
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-help
mailing list