[R] Suggestion for big files [was: Re: A comment about R:]
pinard at iro.umontreal.ca
Sun Jan 8 23:51:23 CET 2006
>> [...] according to comments I've read, MySQL does not seem to scale
>> well with the database size according to the comments I've read,
>> especially when records have to be decorated with random numbers and
>> later sorted.
>With SQL there is always a way to do what you want quickly, but you
>need to think carefully about what operations are most common in your
>database. For example, the problem is much easier if you can assume
>that the rows are numbered sequentially from 1 to n. This could be
>enfored using a trigger whenever a record is added/deleted. This would
>slow insertions/deletions but speed selects.
Sure (for a caricature example) that if database records are already
decorated with random numbers, and an index is built over the
decoration, random sampling may indeed be done quicker :-). The fact is
that (at least our) databases are not especially designed for random
sampling, and people in charge would resist redesigning them merely
because there would be a few needs for random sampling.
What would be ideal is being able to build random samples out of any big
database or file, with equal ease. The fact is that it's doable.
(Brian Ripley points out that R textual I/O has too much overhead for
being usable, so one should rather say, sadly: "It's doable outside R".)
>> Just for fun: here, "sample(100000000, 10)" in R is slowish already
>This is another example where greater knowledge of problem can yield
>speed increases. Here (where the number of selections is much smaller
>than the total number of objects) you are better off generating 10
>numbers with runif(10, 0, 1000000) and then checking that they are
Of course, my remark about "sample()" is related to the previous
discussion. If "sample(N, M)" was more on the O(M) side than being on
the O(N) side (both memory-wise and cpu-wise), it could be used for
preselecting which rows of a big database to include in a random sample,
so building on your idea of using a set of IDs. As the sample of
M records will have to be processed in-memory by R anyway, computing
a vector of M indices does not (or should not) increase complexity.
However, "sample(N, M)" is likely less usable for randomly sampling
a database, if it is O(N) to start with. About your suggestion of using
"runif" and later checking uniqueness, "sample()" could well be
implemented this way, when the arguments are proper. The "greater
knowledge of the problem" could be built in right into the routine meant
to solve it. "sample(N, M)" could even know how to take advantage of
some simplified case of a "reservoir sampling" technique :-).
>> >[...] a large table of randomly distributed ids [...] (with randomly
>> >generated limits) to select the appropriate number of records.
>[...] a table of random numbers [...] pregenerated for you, you just
>choose a starting and ending index. It will be slow to generate the
>table the first time, but then it will be fast. It will also take up
>quite a bit of space, but space is cheap (and time is not!)
Thanks for the explanation.
In the case under consideration here (random sampling of a big file or
database), I would be tempted to guess that the time required for
generating pseudo-random numbers is negligible when compared to the
overall input/output time, so it might be that pregenerating randomized
IDs is not worth the trouble. Also given that whenever the database
size changes, the list of pregenerated IDs is not valid anymore.
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-help