[R] Suggestion for big files [was: Re: A comment about R:]
pinard at iro.umontreal.ca
Sun Jan 8 19:47:07 CET 2006
>> Selecting a sample is easy. Yet, I'm not aware of any SQL device for
>> easily selecting a _random_ sample of the records of a given table.
>> On the other hand, I'm no SQL specialist, others might know better.
>There are a number of such devices, which tend to be rather SQL variant
>specific. Try googling for select random rows mysql, select random
>rows pgsql, etc.
Thanks as well for these hints. Googling around as your suggested (yet
keeping my eyes in the MySQL direction, because this is what we use),
getting MySQL itself to do the selection is a bit discouraging, as
according to comments I've read, MySQL does not seem to scale well with
the database size according to the comments I've read, especially when
records have to be decorated with random numbers and later sorted.
Yet, I did not drive any benchmark myself, and would not blindly take
everything I read for granted, given that MySQL developers have speed in
mind, and there are ways to interrupt a sort before running it to full
completion, when only a few sorted records are wanted.
>Another possibility is to generate a large table of randomly
>distributed ids and then use that (with randomly generated limits) to
>select the appropriate number of records.
I'm not sure I understand your idea (what mixes me in the "randomly
generated limits" part). If the "large table" is much larger than the
size of the wanted sample, we might not be gaining much.
Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).
All in all, if I ever have such a problem, a practical solution probably
has to be outside of R, and maybe outside SQL as well.
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-help