[R-sig-hpc] Quick rsprng questions

Thu Jul 30 23:20:44 CEST 2009

I'm far from an authority, but I'll try to answer.

On Thu, 2009-07-30 at 16:35 -0400, Thomas Hampton wrote:
> Hello Ross,
> 
> We recently installed rsprng on our beowulf cluster. My recollection  
> was that
> R routines like sample() gave weird results before we did this --  
> namely,
> you could sample as many times as you like and you would get the same  
> result, as if
> you were setting the seed to some fixed value (even when you did not).
> 
> I am not observing this behavior now.
> 
> My questions are these.
> 
> First, is it a normal feature of clusters to show odd random number  
> properties if
> you do not have something like rsprng in there?
I would not expect the behavior you described above.  The only
misbehavior that seems likely is that each node/process in the cluster
gets the same, or at least not fully independent, streams of random
numbers.
> 
> Second, if you install rspring, does the problem just magically go  
> away, or do you need
> to make special calls in your R code to take advantage of rsprng?
First, you need to initialize rsprng properly and second you need to be
able to access it.

Unless something else initializes rsprng (e.g., snow provides
setupSPRNG), you need to by calling init.rsprng with appropriate
parameters (which include the total number of processes and the rank of
the process executing the initialization).  This will create independent
streams.

The second issue is getting access to these random numbers.  The uniform
random number generator and anything derived from it should work.  I'm
not sure if the normal random number generator will use SPRNG or not; I
suspect it will.

If you're trying to access the random number stream from C code, it's
tricky.  There are more details on the web page I announced:
http://wiki.r-project.org/rwiki/doku.php?id=packages:cran:rsprng.
> 
> Finally, why (roughly) is random number generation different in the  
> parallel environment to begin with?
In the simplest case, you might get the same random number stream in
each parallel process.  This means the extra runs are pointless and, if
you use them naively, you will think you have a much bigger sample than
you really do.

A more complex problem is that the random number streams could be
dependent, but in a more subtle way.

A simple strategy is to generate a list of random integers to serve as
seeds, ship a different seed to each process, and then set the seed in
each process.  This works with non-parallel RNG's and is probably good
enough in most cases (it's a popular move in the biostat dept here).  I
suspect there are some issues with it, though, because otherwise there'd
be no need to for explicit parallel random number generators like SPRNG.

Ross
> 
> Thanks very much,
> 
> Tom