[R] Suggestion for big files [was: Re: A comment about R:]
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jan 6 09:33:05 CET 2006
>>>>> "FrPi" == François Pinard <pinard at iro.umontreal.ca>
>>>>> on Thu, 5 Jan 2006 22:41:21 -0500 writes:
FrPi> [Brian Ripley]
>> I rather thought that using a DBMS was standard practice in the
>> R community for those using large datasets: it gets discussed rather
>> often.
FrPi> Indeed. (I tried RMySQL even before speaking of R to my co-workers.)
>> Another possibility is to make use of the several DBMS interfaces already
>> available for R. It is very easy to pull in a sample from one of those,
>> and surely keeping such large data files as ASCII not good practice.
FrPi> Selecting a sample is easy. Yet, I'm not aware of any
FrPi> SQL device for easily selecting a _random_ sample of
FrPi> the records of a given table. On the other hand, I'm
FrPi> no SQL specialist, others might know better.
FrPi> We do not have a need yet for samples where I work,
FrPi> but if we ever need such, they will have to be random,
FrPi> or else, I will always fear biases.
>> One problem with Francois Pinard's suggestion (the credit has got lost)
>> is that R's I/O is not line-oriented but stream-oriented. So selecting
>> lines is not particularly easy in R.
FrPi> I understand that you mean random access to lines,
FrPi> instead of random selection of lines. Once again,
FrPi> this chat comes out of reading someone else's problem,
FrPi> this is not a problem I actually have. SPSS was not
FrPi> randomly accessing lines, as data files could well be
FrPi> hold on magnetic tapes, where random access is not
FrPi> possible on average practice. SPSS reads (or was
FrPi> reading) lines sequentially from beginning to end, and
FrPi> the _random_ sample is built while the reading goes.
FrPi> Suppose the file (or tape) holds N records (N is not
FrPi> known in advance), from which we want a sample of M
FrPi> records at most. If N <= M, then we use the whole
FrPi> file, no sampling is possible nor necessary.
FrPi> Otherwise, we first initialise M records with the
FrPi> first M records of the file. Then, for each record in
FrPi> the file after the M'th, the algorithm has to decide
FrPi> if the record just read will be discarded or if it
FrPi> will replace one of the M records already saved, and
FrPi> in the latter case, which of those records will be
FrPi> replaced. If the algorithm is carefully designed,
FrPi> when the last (N'th) record of the file will have been
FrPi> processed this way, we may then have M records
FrPi> randomly selected from N records, in such a a way that
FrPi> each of the N records had an equal probability to end
FrPi> up in the selection of M records. I may seek out for
FrPi> details if needed.
FrPi> This is my suggestion, or in fact, more a thought that
FrPi> a suggestion. It might represent something useful
FrPi> either for flat ASCII files or even for a stream of
FrPi> records coming out of a database, if those effectively
FrPi> do not offer ready random sampling devices.
FrPi> P.S. - In the (rather unlikely, I admit) case the gang
FrPi> I'm part of would have the need described above, and
FrPi> if I then dared implementing it myself, would it be welcome?
I think this would be a very interesting tool and
I'm also intrigued about the details of the algorithm you
outline above.
If it would be made to work on all kind of read.table()-readable
files, (i.e. of course including *.csv); that might be a valuable
tool for all those -- and there are many -- for whom working
with DBMs is too daunting initially.
Martin Maechler, ETH Zurich
More information about the R-help
mailing list