[Rd] portable parallel seeds project: request for critiques

Fri Feb 17 23:52:17 CET 2012

Ok, I guess I need to look more carefully.

Thanks,
Paul

On 12-02-17 04:44 PM, Paul Johnson wrote:
> On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert<pgilbert902 at gmail.com>  wrote:
>> Paul
>>
>> I think (perhaps incorrectly) of the general problem being that one wants to
>> run a random experiment, on a single node, or two nodes, or ten nodes, or
>> any number of nodes, and reliably be able to reproduce the experiment
>> without concern about how many nodes it runs on when you re-run it.
>>
>>  From your description I don't have the impression your solution would do
>> that. Am I misunderstanding?
>>
>
> Well, I think my approach does that!  Each time a function runs, it grabs
> a pre-specified set of seed values and initializes the R .Random.seed
> appropriately.
>
> Since I take the pre-specified seeds from the L'Ecuyer et al approach
> (cite below), I believe that
> means each separate stream is dependably uncorrelated and non-overlapping, both
> within a particular run and across runs.
>
>> A second problem is that you want to use a proven algorithm for generating
>> the numbers. This is implicitly solved by the above, because you always get
>> the same result as you do on one node with a well proven RNG. If you
>> generate a string of seed and then numbers from those, do you have a proven
>> RNG?
>>
> Luckily, I think that part was solved by people other than me:
>
> L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An
> object-oriented random-number package with many long streams and
> substreams. Operations Research 50 1073–5.
> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf
>
>
>
>> Paul
>>
>>
>> On 12-02-17 03:57 PM, Paul Johnson wrote:
>>>
>>> I've got another edition of my simulation replication framework.  I'm
>>> attaching 2 R files and pasting in the readme.
>>>
>>> I would especially like to know if I'm doing anything that breaks
>>> .Random.seed or other things that R's parallel uses in the
>>> environment.
>>>
>>> In case you don't want to wrestle with attachments, the same files are
>>> online in our SVN
>>>
>>>
>>> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
>>>
>>>
>>> ## Paul E. Johnson CRMDA<pauljohn at ku.edu>
>>> ## Portable Parallel Seeds Project.
>>> ## 2012-02-18
>>>
>>> Portable Parallel Seeds Project
>>>
>>> This is how I'm going to recommend we work with random number seeds in
>>> simulations. It enhances work that requires runs with random numbers,
>>> whether runs are in a cluster computing environment or in a single
>>> workstation.
>>>
>>> It is a solution for two separate problems.
>>>
>>> Problem 1. I scripted up 1000 R runs and need high quality,
>>> unique, replicable random streams for each one. Each simulation
>>> runs separately, but I need to be confident their streams are
>>> not correlated or overlapping. For replication, I need to be able to
>>> select any run, say 667, and restart it exactly as it was.
>>>
>>> Problem 2. I've written a Parallel MPI (Message Passing Interface)
>>> routine that launches 1000 runs and I need to assure each has
>>> a unique, replicatable, random stream. I need to be able to
>>> select any run, say 667, and restart it exactly as it was.
>>>
>>> This project develops one approach to create replicable simulations.
>>> It blends ideas about seed management from John M. Chambers
>>> Software for Data Analysis (2008) with ideas from the snowFT
>>> package by Hana Sevcikova and Tony R. Rossini.
>>>
>>>
>>> Here's my proposal.
>>>
>>> 1. Run a preliminary program to generate an array of seeds
>>>
>>> run1:   seed1.1   seed1.2   seed1.3
>>> run2:   seed2.1   seed2.2   seed2.3
>>> run3:   seed3.1   seed3.2   seed3.3
>>> ...      ...       ...
>>> run1000   seed1000.1  seed1000.2   seed1000.3
>>>
>>> This example provides 3 separate streams of random numbers within each
>>> run. Because we will use the L'Ecuyer "many separate streams"
>>> approach, we are confident that there is no correlation or overlap
>>> between any of the runs.
>>>
>>> The projSeeds has to have one row per project, but it is not a huge
>>> file. I created seeds for 2000 runs of a project that requires 2 seeds
>>> per run.  The saved size of the file 104443kb, which is very small. By
>>> comparison, a 1400x1050 jpg image would usually be twice that size.
>>> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
>>> still pretty small.
>>>
>>> Because the seeds are saved in a file, we are sure each
>>> run can be replicated. We just have to teach each program
>>> how to use the seeds. That is step two.
>>>
>>>
>>> 2. Inside each run, an initialization function runs that loads the
>>> seeds file and takes the row of seeds that it needs.  As the
>>> simulation progresses, the user can ask for random numbers from the
>>> separate streams. When we need random draws from a particular stream,
>>> we set the variable "currentStream" with the function useStream().
>>>
>>> The function initSeedStreams creates several objects in
>>> the global environment. It sets the integer currentStream,
>>> as well as two list objects, startSeeds and currentSeeds.
>>> At the outset of the run, startSeeds and currentSeeds
>>> are the same thing. When we change the currentStream
>>> to a different stream, the currentSeeds vector is
>>> updated to remember where that stream was when we stopped
>>> drawing numbers from it.
>>>
>>>
>>> Now, for the proof of concept. A working example.
>>>
>>> Step 1. Create the Seeds. Review the R program
>>>
>>> seedCreator.R
>>>
>>> That creates the file "projSeeds.rda".
>>>
>>>
>>> Step 2. Use one row of seeds per run.
>>>
>>> Please review "controlledSeeds.R" to see an example usage
>>> that I've tested on a cluster.
>>>
>>> "controlledSeeds.R" can also be run on a single workstation for
>>> testing purposes.  There is a variable "runningInMPI" which determines
>>> whether the code is supposed to run on the RMPI cluster or just in a
>>> single workstation.
>>>
>>>
>>> The code for each run of the model begins by loading the
>>> required libraries and loading the seed file, if it exists, or
>>> generating a new "projSeed" object if it is not found.
>>>
>>> library(parallel)
>>> RNGkind("L'Ecuyer-CMRG")
>>> set.seed(234234)
>>> if (file.exists("projSeeds.rda")) {
>>>    load("projSeeds.rda")
>>> } else {
>>>    source("seedCreator.R")
>>> }
>>>
>>> ## Suppose the "run" number is:
>>> run<- 232
>>> initSeedStreams(run)
>>>
>>> After that, R's random generator functions will draw values
>>> from the first random random stream that was initialized
>>> in projSeeds. When each repetition (run) occurs,
>>> R looks up the right seed for that run, and uses it.
>>>
>>> If the user wants to begin drawing observations from the
>>> second random stream, this command is used:
>>>
>>> useStream(2)
>>>
>>> If the user has drawn values from stream 1 already, but
>>> wishes to begin again at the initial point in that stream,
>>> use this command
>>>
>>> useStream(1, origin = TRUE)
>>>
>>>
>>> Question: Why is this approach better for parallel runs?
>>>
>>> Answer: After a batch of simulations, we can re-start any
>>> one of them and repeat it exactly. This builds on the idea
>>> of the snowFT package, by Hana Sevcikova and A.J. Rossini.
>>>
>>> That is different from the default approach of most R parallel
>>> designs, including R's own parallel, RMPI and snow.
>>>
>>> The ordinary way of controlling seeds in R parallel would initialize
>>> the 50 nodes, and we would lose control over seeds because runs would
>>> be repeatedly assigned to nodes. The aim here is to make sure that
>>> each particular run has a known starting point. After a batch of
>>> 10,000 runs, we can look and say "something funny happened on run
>>> 1,323" and then we can bring that back to life later, easily.
>>>
>>>
>>>
>>> Question: Why is this better than the simple old approach of
>>> setting the seeds within each run with a formula like
>>>
>>> set.seed(2345 + 10 * run)
>>>
>>> Answer: That does allow replication, but it does not assure
>>> that each run uses non-overlapping random number streams. It
>>> offers absolutely no assurance whatsoever that the runs are
>>> actually non-redundant.
>>>
>>> Nevertheless, it is a method that is widely used and recommended
>>> by some visible HOWTO guides.
>>>
>>>
>>>
>>> Citations
>>>
>>> Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant
>>>   Simple Network of Workstations. R package version 1.2-0.
>>>   http://CRAN.R-project.org/package=snowFT
>>>
>>> John M Chambers (2008). SoDA: Functions and Exampels for "Software
>>>    for Data Analysis". R package version 1.0-3.
>>>
>>> John M Chambers (2008) Software for Data Analysis. Springer.
>>>
>>>
>>>
>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>