[R-sig-hpc] creating many separate streams (more streams than nodes)

Tue Apr 19 18:21:33 CEST 2011

Hi, everybody:

I sat down to write a quick question, and it is turning into a very
complicated mess.  I will give you the blunt question. And then I'll
leave all those other words I had written so that future Googlers
might learn from my experience. Or  you might understand me better.

I want to spawn 8000 random generator streams and save their initial
states. Then I want to be able to re-run simulations against those
generators and get the same results, whether or not I'm on just one
machine or on a cluster.

The treatment of random seeds in packages like SNOW or Rmpi is aimed
at initializing the cluster nodes, rather than initializing the
separate streams.  My cluster has 60 nodes, and so the 60 separate
streams from SNOW will not suffice.  So I'm reading code in rlecuyer
and rstream to see how this might be done.  My background in
agent-based simulation gives me some experience with serialization of
PRNG objects, but R makes it very tough to see through the veil
because there are hidden global objects like the base generator and
the interface to it requires considerable care.

Now the "blah blah blah". After hacking on this for a couple of hours,
I think it boils down to 4 points.

1. John Chambers seems to agree this would be a good idea.

I got the idea that my plan should be do-able from John Chambers'
book, Software for Data Analysis, where he discusses replication of
simulations. For replication, he says one shold save the state of the
generator with each stream that is drawn, then use that to
re-instantiate identical streams in the future.  I have made that work
on a single system using the scheme he describes, which basically
involves grabbng .Random.seed and saving it.  If you look in the R
code for the lsoda package (the one Chambers discusses), you will see
it is not more complicated than saving .Random.seed like this.

drawSample <- function(N=NULL, b0=NULL, b1=NULL, pm=NULL, setseed=NULL){
  if (!is.null(setseed)){set.seed(setseed)  <- setseed; print("seed
NOT NULL \n \n \n \n \n")};
  inState <- .Random.seed
  x <- rnorm(N, m=0, sd=1)
  e <- rnorm(N, m=0, sd=sqrt(1-b1*b1))
  y <- b0 + b1 * x + e
  mm <- missingMask(N, pm)
  s1 <- data.frame(y,x)
  dat <- imposeMissings(s1, mm)
  list("dat"=dat, "inState"=inState)
}

This is from an "actual project" I'm working on, please ignore the x
and y part, just note it uses a seed object if given, and it saves the
seed's inState.

I've verified that saves the state, and if I pass that back in again,
I can replicate the sample exactly, when working on exactly one
system.  As long as we run the project 8000 times in a row, we can
save those seed states and re-create them later.

No, go parallel. Humphf.  Recently a poster in r-sigh-hpc said that
these parallel generator packages seem to be not quite done, and I
think there was some truth in that.

2. rstream (by L'Ecuyer and Leydold) has a very promising writeup

P. L’Ecuyer and J. Leydold (2005): rstream: Streams of Random Numbers
for Stochastic Simulation, R News 5(2), 16–20.

They say rstream  is 1) exactly what I want to do,  2) they claim to
be faster & more efficient than rlecuyer,

but 3) the saving and re-creation of the streams seems to be fraught with peril:

Warning
The underlying RngStreams library uses a global variable to store the
package seed. This vari-
able is also stored inside R. Whenever a new instance of a
"rstream.lecuyer" object is created the
value of global variable is set to the value of the R object. Thus
there is no problem when such
"rstream.lecuyer" objects are packed for using in later R sessions.
However, if such packed objects
are not stored in the workspace image, then the R variable gets lost
and there is a (extremely small)
chance that newly created objects are not stochastically independent
from restored objects.

Huh?  That's quite horrible, really.  Until that problem is solved and
the warning is eliminated from the documentation, I can't say I'd ever
have any confidence.

3. The state of the rsprng package in R is not good, mainly because
the state of the underlying C++ library is not good.

I'd have to fix up rsprng if I wanted to keep going along my current
path with the  MT19937  generator, of course, and the only open source
implementation of the multiple streams from MT19937 approach is in
SPRNG and the R package rsprng.  If you have looked at the code in
SPRNG 2.0, you will know that it is not in a good condition. It does
not compile cleanly with current GCC.  Just getting the C++ part to
compile requires quite a bit of futzing about
(https://stat.ethz.ch/pipermail/r-sig-hpc/2010-January/000527.html).
The Fortran examples appear to be completely unworkable.  The SPRNG
team has made several updates since rsprng was written, and they are
not compatible with rsprng.  Conclusion: the state of the code is just
not good.

I've read that the MT19937 authors have developed their own
implementation of the idea behind SPRNG, to spawn separate MT19937
generators with unique 632 dimensional states so that they are unique
streams.  However, as far as I can tell, that's no public code.  The
Nvidia corporation is working on a commercial implementation of that
same idea for their GPUs.

4. The Many streams approach described here:

P. L’Ecuyer, R. Simard, E.J.Chen and W.D.Kelton: An Object-Oriented
Random-Number Package
With Many Long Streams and Substreams; Operations Research, vol. 50,
nr. 6, 2002.

is still considered valid. The idea there is to take the very long
stream of random numbers, and take Streams out of it in contiguous
chunks.  Unlike MT19937, the backbone generator MRG232a, allows one to
grab the starting points of those separate chunks without an
inordinate amount of computation.

One implementation of that idea within R is found in the rlecuyer
package, which is the most prominent package for parallel random
numbers (Hana Sevcikova & Tony Rossini have been contributors here).
Since rlecuyer is literally built right on top of the C code by
L'Ecuyer, and the R code itself is quite clear, I think I may be able
to adapt that to the purpose.

My fear about the rlecuyer package is that all of the methods start
with periods, which means that the author's probably don't want idiots
like me digging about in there.

Then I found rstream, which appears to have the right idea for me,
except that the saving and restoration of the stream objects is, well,
frightening.

I'll keep reading and hoping that some of  you email about it.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas