[R-sig-hpc] creating many separate streams (more streams than nodes)

Tue Apr 19 18:43:56 CEST 2011

Paul,

The functionality that you're describing is implemented in the snowFT 
package. By default, the function
performParallel (and the underlying clusterApplyFT) initializes one 
stream per replicate, and thus provides
reproducible results. The function that does the RNG initialization is 
called clusterSetupRNGstreamRepli.

Regarding your comment about the names of the rlecuyer functions: Yes, 
our original intention was to
keep those functions internal and let snow and snowFT to deal with them, 
but there is no reason why
you couldn't use them directly.

Hana

On 4/19/11 9:21 AM, Paul Johnson wrote:
> Hi, everybody:
>
> I sat down to write a quick question, and it is turning into a very
> complicated mess.  I will give you the blunt question. And then I'll
> leave all those other words I had written so that future Googlers
> might learn from my experience. Or  you might understand me better.
>
> I want to spawn 8000 random generator streams and save their initial
> states. Then I want to be able to re-run simulations against those
> generators and get the same results, whether or not I'm on just one
> machine or on a cluster.
>
> The treatment of random seeds in packages like SNOW or Rmpi is aimed
> at initializing the cluster nodes, rather than initializing the
> separate streams.  My cluster has 60 nodes, and so the 60 separate
> streams from SNOW will not suffice.  So I'm reading code in rlecuyer
> and rstream to see how this might be done.  My background in
> agent-based simulation gives me some experience with serialization of
> PRNG objects, but R makes it very tough to see through the veil
> because there are hidden global objects like the base generator and
> the interface to it requires considerable care.
>
> Now the "blah blah blah". After hacking on this for a couple of hours,
> I think it boils down to 4 points.
>
> 1. John Chambers seems to agree this would be a good idea.
>
> I got the idea that my plan should be do-able from John Chambers'
> book, Software for Data Analysis, where he discusses replication of
> simulations. For replication, he says one shold save the state of the
> generator with each stream that is drawn, then use that to
> re-instantiate identical streams in the future.  I have made that work
> on a single system using the scheme he describes, which basically
> involves grabbng .Random.seed and saving it.  If you look in the R
> code for the lsoda package (the one Chambers discusses), you will see
> it is not more complicated than saving .Random.seed like this.
>
> drawSample<- function(N=NULL, b0=NULL, b1=NULL, pm=NULL, setseed=NULL){
>    if (!is.null(setseed)){set.seed(setseed)<- setseed; print("seed
> NOT NULL \n \n \n \n \n")};
>    inState<- .Random.seed
>    x<- rnorm(N, m=0, sd=1)
>    e<- rnorm(N, m=0, sd=sqrt(1-b1*b1))
>    y<- b0 + b1 * x + e
>    mm<- missingMask(N, pm)
>    s1<- data.frame(y,x)
>    dat<- imposeMissings(s1, mm)
>    list("dat"=dat, "inState"=inState)
> }
>
> This is from an "actual project" I'm working on, please ignore the x
> and y part, just note it uses a seed object if given, and it saves the
> seed's inState.
>
> I've verified that saves the state, and if I pass that back in again,
> I can replicate the sample exactly, when working on exactly one
> system.  As long as we run the project 8000 times in a row, we can
> save those seed states and re-create them later.
>
> No, go parallel. Humphf.  Recently a poster in r-sigh-hpc said that
> these parallel generator packages seem to be not quite done, and I
> think there was some truth in that.
>
>
> 2. rstream (by L'Ecuyer and Leydold) has a very promising writeup
>
> P. L’Ecuyer and J. Leydold (2005): rstream: Streams of Random Numbers
> for Stochastic Simulation, R News 5(2), 16–20.
>
> They say rstream  is 1) exactly what I want to do,  2) they claim to
> be faster&  more efficient than rlecuyer,
>
> but 3) the saving and re-creation of the streams seems to be fraught with peril:
>
> Warning
> The underlying RngStreams library uses a global variable to store the
> package seed. This vari-
> able is also stored inside R. Whenever a new instance of a
> "rstream.lecuyer" object is created the
> value of global variable is set to the value of the R object. Thus
> there is no problem when such
> "rstream.lecuyer" objects are packed for using in later R sessions.
> However, if such packed objects
> are not stored in the workspace image, then the R variable gets lost
> and there is a (extremely small)
> chance that newly created objects are not stochastically independent
> from restored objects.
>
> Huh?  That's quite horrible, really.  Until that problem is solved and
> the warning is eliminated from the documentation, I can't say I'd ever
> have any confidence.
>
>
> 3. The state of the rsprng package in R is not good, mainly because
> the state of the underlying C++ library is not good.
>
> I'd have to fix up rsprng if I wanted to keep going along my current
> path with the  MT19937  generator, of course, and the only open source
> implementation of the multiple streams from MT19937 approach is in
> SPRNG and the R package rsprng.  If you have looked at the code in
> SPRNG 2.0, you will know that it is not in a good condition. It does
> not compile cleanly with current GCC.  Just getting the C++ part to
> compile requires quite a bit of futzing about
> (https://stat.ethz.ch/pipermail/r-sig-hpc/2010-January/000527.html).
> The Fortran examples appear to be completely unworkable.  The SPRNG
> team has made several updates since rsprng was written, and they are
> not compatible with rsprng.  Conclusion: the state of the code is just
> not good.
>
> I've read that the MT19937 authors have developed their own
> implementation of the idea behind SPRNG, to spawn separate MT19937
> generators with unique 632 dimensional states so that they are unique
> streams.  However, as far as I can tell, that's no public code.  The
> Nvidia corporation is working on a commercial implementation of that
> same idea for their GPUs.
>
> 4. The Many streams approach described here:
>
> P. L’Ecuyer, R. Simard, E.J.Chen and W.D.Kelton: An Object-Oriented
> Random-Number Package
> With Many Long Streams and Substreams; Operations Research, vol. 50,
> nr. 6, 2002.
>
> is still considered valid. The idea there is to take the very long
> stream of random numbers, and take Streams out of it in contiguous
> chunks.  Unlike MT19937, the backbone generator MRG232a, allows one to
> grab the starting points of those separate chunks without an
> inordinate amount of computation.
>
> One implementation of that idea within R is found in the rlecuyer
> package, which is the most prominent package for parallel random
> numbers (Hana Sevcikova&  Tony Rossini have been contributors here).
> Since rlecuyer is literally built right on top of the C code by
> L'Ecuyer, and the R code itself is quite clear, I think I may be able
> to adapt that to the purpose.
>
> My fear about the rlecuyer package is that all of the methods start
> with periods, which means that the author's probably don't want idiots
> like me digging about in there.
>
> Then I found rstream, which appears to have the right idea for me,
> except that the saving and restoration of the stream objects is, well,
> frightening.
>
> I'll keep reading and hoping that some of  you email about it.
>