[R-sig-hpc] parallel random numbers: set.seed(i), rsprng, rlecuyer, ??

Thu Jun 17 11:29:14 CEST 2010

Again, I've been asked about RNG (random number generation) in
the context of - say embarrassingly - parallel computation.

I've heard of many people just using
set.seed(i)  on the i-th computing node,
and when asked for advice, I've typically even acknowledged,
that I though this to be save, in the case of
  { i in 1:N }  and  N < 1000 (say),
but have always warned, that people have argued (I vaguely
recall a talk given by Tony Rossini) that you need something
else, namely with *proven* independent random streams:

And there are sophisticated approaches: the CRAN task view on
HPC  mentions the R packages, 'rsprng' and 'rlecuyer'.
The first interfaces to SPRNG 2.0, ie extra software which is
outdated (SPRNG 4.0 is current, but not back-compatible), and
does not compile without error {
  gcc -c -O3 -DLittleEndian -DUSE_PMLCG  -DINTEL   -I../include  metropolis.c
  metropolis.c:157:14: error: operator '!' has no right operand
}
Even if fixed, it's not trivial to install in a cluster
environment without root priviledges, and so I guess the package
will currently be used by less than 1% of people doing
cluster/parallel R computations.
The second pkg, rlecuyer, has all functions "invisible", as
their names start with ".lec." which also does not look like
production-quality, at first look, at least.
More importantly, ideally the R functions used, which call
rnorm(), runif(), r<...> and may have internal C code also
interfacing to R's C-API 'unif_rand()'.
So a ``reasonable'' R package should really ensure that R's
runif(), rnorm(), ... RNGs automatically use the package's
alternative, by establishing itself {C level / R level}
as R's RNGs; in R, see
   help(Random.user)
I'm pretty sure that both packages do not make use of this R API
feature.

I'm asking for advice, mostly if you really have collected
experience, and am hence including the package authors of
'rsprng' and 'rlecuyer' in the addressees of this e-mail.
(IIRC, you all have to subscribe to R-SIG-HPC if you want your
 replies to be  posted to the mailing list. 
 So, for now, please use "reply to all" if possible)

And yes, let's think of a simple situation of an "embarassingly
parallel" application, e.g., pure simulation,
or bootstrap (or cross-validation).

Thanks in advance,

Martin Maechler,
ETH Zurich