[R] Suggestion for big files [was: Re: A comment about R:]

Sun Jan 8 23:51:23 CET 2006

[hadley wickham]

>> [...] according to comments I've read, MySQL does not seem to scale
>> well with the database size according to the comments I've read,
>> especially when records have to be decorated with random numbers and
>> later sorted.

>With SQL there is always a way to do what you want quickly, but you
>need to think carefully about what operations are most common in your
>database.  For example, the problem is much easier if you can assume
>that the rows are numbered sequentially from 1 to n.  This could be
>enfored using a trigger whenever a record is added/deleted.  This would
>slow insertions/deletions but speed selects.

Sure (for a caricature example) that if database records are already 
decorated with random numbers, and an index is built over the 
decoration, random sampling may indeed be done quicker :-). The fact is 
that (at least our) databases are not especially designed for random 
sampling, and people in charge would resist redesigning them merely 
because there would be a few needs for random sampling.

What would be ideal is being able to build random samples out of any big 
database or file, with equal ease.  The fact is that it's doable.  
(Brian Ripley points out that R textual I/O has too much overhead for 
being usable, so one should rather say, sadly: "It's doable outside R".)

>> Just for fun: here, "sample(100000000, 10)" in R is slowish already
>> :-).

>This is another example where greater knowledge of problem can yield
>speed increases.  Here (where the number of selections is much smaller
>than the total number of objects) you are better off generating 10
>numbers with runif(10, 0, 1000000) and then checking that they are
>unique

Of course, my remark about "sample()" is related to the previous 
discussion.  If "sample(N, M)" was more on the O(M) side than being on 
the O(N) side (both memory-wise and cpu-wise), it could be used for
preselecting which rows of a big database to include in a random sample, 
so building on your idea of using a set of IDs.  As the sample of 
M records will have to be processed in-memory by R anyway, computing 
a vector of M indices does not (or should not) increase complexity.

However, "sample(N, M)" is likely less usable for randomly sampling 
a database, if it is O(N) to start with.  About your suggestion of using 
"runif" and later checking uniqueness, "sample()" could well be 
implemented this way, when the arguments are proper.  The "greater 
knowledge of the problem" could be built in right into the routine meant 
to solve it.  "sample(N, M)" could even know how to take advantage of 
some simplified case of a "reservoir sampling" technique :-).

>> >[...] a large table of randomly distributed ids [...] (with randomly
>> >generated limits) to select the appropriate number of records.

>[...] a table of random numbers [...] pregenerated for you, you just
>choose a starting and ending index.  It will be slow to generate the
>table the first time, but then it will be fast.  It will also take up
>quite a bit of space, but space is cheap (and time is not!)

Thanks for the explanation.

In the case under consideration here (random sampling of a big file or 
database), I would be tempted to guess that the time required for 
generating pseudo-random numbers is negligible when compared to the 
overall input/output time, so it might be that pregenerating randomized 
IDs is not worth the trouble.  Also given that whenever the database 
size changes, the list of pregenerated IDs is not valid anymore.

-- 
François Pinard   http://pinard.progiciels-bpi.ca