[Rd] portable parallel seeds project: request for critiques

Fri Feb 17 22:23:21 CET 2012

Paul

I think (perhaps incorrectly) of the general problem being that one 
wants to run a random experiment, on a single node, or two nodes, or ten 
nodes, or any number of nodes, and reliably be able to reproduce the 
experiment without concern about how many nodes it runs on when you 
re-run it.

 From your description I don't have the impression your solution would 
do that. Am I misunderstanding?

A second problem is that you want to use a proven algorithm for 
generating the numbers. This is implicitly solved by the above, because 
you always get the same result as you do on one node with a well proven 
RNG. If you generate a string of seed and then numbers from those, do 
you have a proven RNG?

Paul

On 12-02-17 03:57 PM, Paul Johnson wrote:
> I've got another edition of my simulation replication framework.  I'm
> attaching 2 R files and pasting in the readme.
>
> I would especially like to know if I'm doing anything that breaks
> .Random.seed or other things that R's parallel uses in the
> environment.
>
> In case you don't want to wrestle with attachments, the same files are
> online in our SVN
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
>
>
> ## Paul E. Johnson CRMDA<pauljohn at ku.edu>
> ## Portable Parallel Seeds Project.
> ## 2012-02-18
>
> Portable Parallel Seeds Project
>
> This is how I'm going to recommend we work with random number seeds in
> simulations. It enhances work that requires runs with random numbers,
> whether runs are in a cluster computing environment or in a single
> workstation.
>
> It is a solution for two separate problems.
>
> Problem 1. I scripted up 1000 R runs and need high quality,
> unique, replicable random streams for each one. Each simulation
> runs separately, but I need to be confident their streams are
> not correlated or overlapping. For replication, I need to be able to
> select any run, say 667, and restart it exactly as it was.
>
> Problem 2. I've written a Parallel MPI (Message Passing Interface)
> routine that launches 1000 runs and I need to assure each has
> a unique, replicatable, random stream. I need to be able to
> select any run, say 667, and restart it exactly as it was.
>
> This project develops one approach to create replicable simulations.
> It blends ideas about seed management from John M. Chambers
> Software for Data Analysis (2008) with ideas from the snowFT
> package by Hana Sevcikova and Tony R. Rossini.
>
>
> Here's my proposal.
>
> 1. Run a preliminary program to generate an array of seeds
>
> run1:   seed1.1   seed1.2   seed1.3
> run2:   seed2.1   seed2.2   seed2.3
> run3:   seed3.1   seed3.2   seed3.3
> ...      ...       ...
> run1000   seed1000.1  seed1000.2   seed1000.3
>
> This example provides 3 separate streams of random numbers within each
> run. Because we will use the L'Ecuyer "many separate streams"
> approach, we are confident that there is no correlation or overlap
> between any of the runs.
>
> The projSeeds has to have one row per project, but it is not a huge
> file. I created seeds for 2000 runs of a project that requires 2 seeds
> per run.  The saved size of the file 104443kb, which is very small. By
> comparison, a 1400x1050 jpg image would usually be twice that size.
> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
> still pretty small.
>
> Because the seeds are saved in a file, we are sure each
> run can be replicated. We just have to teach each program
> how to use the seeds. That is step two.
>
>
> 2. Inside each run, an initialization function runs that loads the
> seeds file and takes the row of seeds that it needs.  As the
> simulation progresses, the user can ask for random numbers from the
> separate streams. When we need random draws from a particular stream,
> we set the variable "currentStream" with the function useStream().
>
> The function initSeedStreams creates several objects in
> the global environment. It sets the integer currentStream,
> as well as two list objects, startSeeds and currentSeeds.
> At the outset of the run, startSeeds and currentSeeds
> are the same thing. When we change the currentStream
> to a different stream, the currentSeeds vector is
> updated to remember where that stream was when we stopped
> drawing numbers from it.
>
>
> Now, for the proof of concept. A working example.
>
> Step 1. Create the Seeds. Review the R program
>
> seedCreator.R
>
> That creates the file "projSeeds.rda".
>
>
> Step 2. Use one row of seeds per run.
>
> Please review "controlledSeeds.R" to see an example usage
> that I've tested on a cluster.
>
> "controlledSeeds.R" can also be run on a single workstation for
> testing purposes.  There is a variable "runningInMPI" which determines
> whether the code is supposed to run on the RMPI cluster or just in a
> single workstation.
>
>
> The code for each run of the model begins by loading the
> required libraries and loading the seed file, if it exists, or
> generating a new "projSeed" object if it is not found.
>
> library(parallel)
> RNGkind("L'Ecuyer-CMRG")
> set.seed(234234)
> if (file.exists("projSeeds.rda")) {
>    load("projSeeds.rda")
> } else {
>    source("seedCreator.R")
> }
>
> ## Suppose the "run" number is:
> run<- 232
> initSeedStreams(run)	
>
> After that, R's random generator functions will draw values
> from the first random random stream that was initialized
> in projSeeds. When each repetition (run) occurs,
> R looks up the right seed for that run, and uses it.
>
> If the user wants to begin drawing observations from the
> second random stream, this command is used:
>
> useStream(2)
>
> If the user has drawn values from stream 1 already, but
> wishes to begin again at the initial point in that stream,
> use this command
>
> useStream(1, origin = TRUE)
>
>
> Question: Why is this approach better for parallel runs?
>
> Answer: After a batch of simulations, we can re-start any
> one of them and repeat it exactly. This builds on the idea
> of the snowFT package, by Hana Sevcikova and A.J. Rossini.
>
> That is different from the default approach of most R parallel
> designs, including R's own parallel, RMPI and snow.
>
> The ordinary way of controlling seeds in R parallel would initialize
> the 50 nodes, and we would lose control over seeds because runs would
> be repeatedly assigned to nodes. The aim here is to make sure that
> each particular run has a known starting point. After a batch of
> 10,000 runs, we can look and say "something funny happened on run
> 1,323" and then we can bring that back to life later, easily.
>
>
>
> Question: Why is this better than the simple old approach of
> setting the seeds within each run with a formula like
>
> set.seed(2345 + 10 * run)
>
> Answer: That does allow replication, but it does not assure
> that each run uses non-overlapping random number streams. It
> offers absolutely no assurance whatsoever that the runs are
> actually non-redundant.
>
> Nevertheless, it is a method that is widely used and recommended
> by some visible HOWTO guides.
>
>
>
> Citations
>
> Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant
>   Simple Network of Workstations. R package version 1.2-0.
>   http://CRAN.R-project.org/package=snowFT
>
> John M Chambers (2008). SoDA: Functions and Exampels for "Software
>    for Data Analysis". R package version 1.0-3.
>
> John M Chambers (2008) Software for Data Analysis. Springer.
>
>
>
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel