[Rd] portable parallel seeds project: request for critiques

Wed Feb 22 19:17:25 CET 2012

Greetings. Answers below.

On Tue, Feb 21, 2012 at 7:04 AM, Petr Savicky <savicky at cs.cas.cz> wrote:

>
> Hi.
>
> In the description of your project in the file
>
>  http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README
>
> you argue as follows
>
>  Question: Why is this better than the simple old approach of
>  setting the seeds within each run with a formula like
>
>  set.seed(2345 + 10 * run)
>
>  Answer: That does allow replication, but it does not assure
>  that each run uses non-overlapping random number streams. It
>  offers absolutely no assurance whatsoever that the runs are
>  actually non-redundant.
>
> The following demonstrates that the function set.seed() for
> the default generator indeed allows to have correlated streams.
>
>  step <- function(x)
>  {
>      x[x < 0] <- x[x < 0] + 2^32
>      x <- (69069 * x + 1) %% 2^32
>      x[x > 2^31] <- x[x > 2^31] - 2^32
>      x
>  }
>
>  n <- 1000
>  seed1 <- 124370417 # any seed
>  seed2 <- step(seed1)
>
>  set.seed(seed1)
>  x <- runif(n)
>  set.seed(seed2)
>  y <- runif(n)
>
>  rbind(seed1, seed2)
>  table(x[-1] == y[-n])
>
> The output is
>
>             [,1]
>  seed1 124370417
>  seed2 205739774
>
>  FALSE  TRUE
>      5   994
>
> This means that if the streams x, y are generated from the two
> seeds above, then y is almost exactly equal to x shifted by 1.
>
> What is the current state of your project?
>
> Petr Savicky.

Thanks for the example. That is exactly the point I've been making
about the old, ordinary way of setting seeds.

I continue testing on my plan, I believe it does work. The example
implementation is successful.  I have not received any critiques that
make me think it is theoretically wrong, but I also have not received
feedback from any body who has very detailed knowledge of R's parallel
package to tell me that my approach is good.

In order for this to be easy for users, I need to put the init streams
and set current stream functions into a package, and then streamline
the process of creating the seed array.  My opinion is that CRAN is
now overflowed with too many "one function" packages, I don't want to
create another one just for these two little functions, and I may roll
it into my general purpose regression package called "rockchalk".

If I worked in a company and had money riding on this, I'd need to
hire someone from R Core to stare at the way I'm setting, saving, and
re-setting the .Random.seed variable to be completely sure there are
not complications.  Because parallel package is so new, there is
relatively little documentation on how it is supposed to work. I'm
just gazing at the source code and trying not to  break anything it
does, and trying to stop it from breaking anything I do.

One technical issue that has been raised to me is that R parallel's
implementation of the L'Ecuyer generator is based on integer valued
variables, whereas the original L'Ecuyer code uses double real
variables.  But I'm trusting the R Core on this, if they code the
generator in a way that is good enough for R itself, it is good enough
for me. (And if that's wrong, we'll all find out together :) ).

Josef Leydold (the rstream package author) has argued that R's
implementation runs more slowly than it ought to. We had some
correspondence and I tracked a few threads in forums. It appears the
approach suggested there is roadblocked by some characteristics deep
down in R and the way random streams are managed.  Packages have only
a limited, tenuous method to replace R's generators with their own
generators. Here is the comment  that L'Ecuyer and Leydold make in the
document that is distributed with rstream..  "rstream: Streams of
Random Numbers for Stochastic Simulation"

"There is one drawback in the implementation. R uses a static table to
implement access to different
kinds of random number generators. Only one slot
(RNGkind(kind="user-supplied")) has been provided
for users who want to add her own generator. This slot relies on a
pointer to a function
called user_unif_rand. Thus rstream has to abuse this slot to add
additional random number
generators. However, there are other packages on CRAN that use the
same trick (randaes, rlecuyer,
rsprng, SuppDists) or load one of these packages (Rmpi, rpvm, rpart,
snow, varSelRF). Thus if a
user loads two of these packages or adds her own uniform random number
generator, the behavior
of at least one of the packages is broken. This problem could be fixed
by some changes in the R
core system, either by adding a package variableto
RNGkind(kind="user-supplied") similar to the
.Call function, or by replacing the static table by a dynamic one.

I interpret all of this to mean that, although other approaches may
work as well or better,
the only approach that can "defend itself " in the ecology of a
running R program is an approach
that fiddles with R's own generators, rather than tries to replace
them. And that's what I'm doing.

pj

ps

Just this moment, while googling for Leydold's comments, I found a
post I had not seen before. This is neat.

Parallel Random Number Generation in C++ and R Using RngStream
Andrew Karl · Randy Eubank · Dennis Young
http://math.la.asu.edu/~eubank/webpage/rngStreamPaper.pdf

I think that faces the same problem L'Ecuyer and Leydold described.
The package may hang its generator on one pointer inside the R random
scheme, but there's no way to be sure it stays there.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas