[R] Suggestion for big files [was: Re: A comment about R:]

Sun Jan 8 22:42:37 CET 2006

> Thanks as well for these hints.  Googling around as your suggested (yet
> keeping my eyes in the MySQL direction, because this is what we use),
> getting MySQL itself to do the selection is a bit discouraging, as
> according to comments I've read, MySQL does not seem to scale well with
> the database size according to the comments I've read, especially when
> records have to be decorated with random numbers and later sorted.

With SQL there is always a way to do what you want quickly, but you
need to think carefully about what operations are most common in your
database.  For example, the problem is much easier if you can assume
that the rows are numbered sequentially from 1 to n.  This could be
enfored using a trigger whenever a record is added/deleted.  This
would slow insertions/deletions but speed selects.

> Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

This is another example where greater knowledge of problem can yield
speed increases.  Here (where the number of selections is much smaller
than the total number of objects) you are better off generating 10
numbers with runif(10, 0, 1000000) and then checking that they are
unique

> >Another possibility is to generate a large table of randomly
> >distributed ids and then use that (with randomly generated limits) to
> >select the appropriate number of records.
>
> I'm not sure I understand your idea (what mixes me in the "randomly
> generated limits" part).  If the "large table" is much larger than the
> size of the wanted sample, we might not be gaining much.

Think about using a table of random numbers.  They are pregenerated
for you, you just choose a starting and ending index.  It will be slow
to generate the table the first time, but then it will be fast.  It
will also take up quite a bit of space, but space is cheap (and time
is not!)

Hadley