[R-sig-hpc] parallel random numbers: set.seed(i), rsprng, rlecuyer, ??

Tue Dec 14 22:51:09 CET 2010

I have a particular variant of this question, and also some additional
information related to the original one.  See also the discussion at
http://rwiki.sciviews.org/doku.php?id=packages:cran:rsprng.

The question is how to generate reproducible streams of parallel random
numbers in a way that is insensitive to the number of nodes used.  If I
can run 10 jobs one time, and 60 the next, I want to get the same random
numbers.

Scenarios
A) Each job generates its own random numbers.
B) A specialized subset of jobs generates random numbers; the subset
expands as the total number of jobs expands.
C) A fixed subset, e.g., 5 jobs, are responsible for generating the
random number.

C) seems the most likely to be achievable. rsprng, which we use,
initializes streams with a call that includes the stream number and the
total number of streams.  I don't know if the stream number alone
determines the sequence, or if the stream number, and possibly messaging
between processes, comes into play.  Even if rsprng came with some
guarantees, other parallel generators (e.g., rlecuyer) might not.

There are also calls for spawning streams and manipulating stream state.
Perhaps these could be jury-rigged into a solution.

Any thoughts?

There's one point in the original post I'd like to correct.  See below.

On Thu, 2010-06-17 at 11:29 +0200, Martin Maechler wrote:
> Again, I've been asked about RNG (random number generation) in
> the context of - say embarrassingly - parallel computation.
> 
> I've heard of many people just using
> set.seed(i)  on the i-th computing node,
> and when asked for advice, I've typically even acknowledged,
> that I though this to be save, in the case of
>   { i in 1:N }  and  N < 1000 (say),
> but have always warned, that people have argued (I vaguely
> recall a talk given by Tony Rossini) that you need something
> else, namely with *proven* independent random streams:
> 
> And there are sophisticated approaches: the CRAN task view on
> HPC  mentions the R packages, 'rsprng' and 'rlecuyer'.
> The first interfaces to SPRNG 2.0, ie extra software which is
> outdated (SPRNG 4.0 is current, but not back-compatible), and
> does not compile without error {
>   gcc -c -O3 -DLittleEndian -DUSE_PMLCG  -DINTEL   -I../include  metropolis.c
>   metropolis.c:157:14: error: operator '!' has no right operand
> }
> Even if fixed, it's not trivial to install in a cluster
> environment without root priviledges, and so I guess the package
> will currently be used by less than 1% of people doing
> cluster/parallel R computations.
> The second pkg, rlecuyer, has all functions "invisible", as
> their names start with ".lec." which also does not look like
> production-quality, at first look, at least.
> More importantly, ideally the R functions used, which call
> rnorm(), runif(), r<...> and may have internal C code also
> interfacing to R's C-API 'unif_rand()'.
> So a ``reasonable'' R package should really ensure that R's
> runif(), rnorm(), ... RNGs automatically use the package's
> alternative, by establishing itself {C level / R level}
> as R's RNGs; in R, see
>    help(Random.user)
> I'm pretty sure that both packages do not make use of this R API
> feature.
My reading of the code is that rsprng, at least, does hook into the
underlying generators, which most of the r-level ones call.  It does not
hook into calls relating to setting the seed or saving and restoring the
stream state; i.e., those must be done through rsprng-specific calls.
It's probably not reasonable to expect that a single-stream seed setting
call would work with a parallel random number generator.

Ross

P.S. We're using rsprng because it is part of the base Debian
distribution.  Packaged versions of rlecuyer are available in other
repositories.
> 
> I'm asking for advice, mostly if you really have collected
> experience, and am hence including the package authors of
> 'rsprng' and 'rlecuyer' in the addressees of this e-mail.
> (IIRC, you all have to subscribe to R-SIG-HPC if you want your
>  replies to be  posted to the mailing list. 
>  So, for now, please use "reply to all" if possible)
> 
> And yes, let's think of a simple situation of an "embarassingly
> parallel" application, e.g., pure simulation,
> or bootstrap (or cross-validation).
> 
> Thanks in advance,
> 
> Martin Maechler,
> ETH Zurich
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc