[R] Suggestion for big files [was: Re: A comment about R:]

Sun Jan 8 19:47:07 CET 2006

[hadley wickham]

>[François Pinard]

>> Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
>> easily selecting a _random_ sample of the records of a given table.
>> On the other hand, I'm no SQL specialist, others might know better.

>There are a number of such devices, which tend to be rather SQL variant
>specific.  Try googling for select random rows mysql, select random
>rows pgsql, etc.

Thanks as well for these hints.  Googling around as your suggested (yet 
keeping my eyes in the MySQL direction, because this is what we use), 
getting MySQL itself to do the selection is a bit discouraging, as 
according to comments I've read, MySQL does not seem to scale well with 
the database size according to the comments I've read, especially when 
records have to be decorated with random numbers and later sorted.

Yet, I did not drive any benchmark myself, and would not blindly take 
everything I read for granted, given that MySQL developers have speed in 
mind, and there are ways to interrupt a sort before running it to full 
completion, when only a few sorted records are wanted.

>Another possibility is to generate a large table of randomly
>distributed ids and then use that (with randomly generated limits) to
>select the appropriate number of records.

I'm not sure I understand your idea (what mixes me in the "randomly 
generated limits" part).  If the "large table" is much larger than the 
size of the wanted sample, we might not be gaining much.

Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

All in all, if I ever have such a problem, a practical solution probably 
has to be outside of R, and maybe outside SQL as well.

-- 
François Pinard   http://pinard.progiciels-bpi.ca