[Rd] portable parallel seeds project: request for critiques
Petr Savicky
savicky at cs.cas.cz
Sat Feb 18 11:40:02 CET 2012
On Fri, Feb 17, 2012 at 09:33:33PM -0600, Paul Johnson wrote:
[...]
> The seed things I'm using are the 6 integer values from the L'Ecuyer.
> If you run the example script, the verbose option causes some to print
> out. The first 3 seeds in a saved project seeds file looks like:
>
> > projSeeds[[1]]
> [[1]]
> [1] 407 376488316 1939487821 1433925148 -1040698333 579503880
> [7] -624878918
>
> [[2]]
> [1] 407 -1107332181 854177397 1773099324 1774170776 -266687360
> [7] 816955059
>
> [[3]]
> [1] 407 936506900 -1924631332 -1380363206 2109234517 1585239833
> [7] -1559304513
>
> The 407 in the first position is an integer R uses to note the type of
> stream for which the seed is intended, in this case R'Lecuyer.
The paralel streams uses MRG32k3a generator by P.L'Ecuyer, whose original
implementation stores its internal state as double precision floating point
numbers. The vector .Random.seed consists of integers. Do you know, whether
a modified implementation of MRG32k3a is used, which works with integers
internally, or the conversion between double and integer is done whenever
the state is stored to .Random.seed?
> I don't have any formal proof that a "good quality hash function"
> would truly create seeds from which independent streams will be drawn.
One of the suggested ways how to generate, say, 2^16 cryptographically secure
random numbers, is to use a counter producing a sequence 1, 2, 3, ..., 2^16
and transform these values by a hash function. An example is Fortuna generator
http://en.wikipedia.org/wiki/Fortuna_(PRNG)
which is also available on CRAN as "randaes". The length of the sequence
is limited, since the sequence contains no collisions. If the sequence is
too long, this could allow to distinguish it from truly random numbers.
After generating 2^16 numbers, the seed is recomputed and another 2^16
numbers are generated.
Such a generator is a very good one. It is not used for simulations, since it
is slower (say by a factor of 5) than the linear generators like Mersenne
Twister, WELL or MRG family of generators. However, if it is used only for
initialization, then the speed is less important.
> There is, however, the proof in the L'Ecuyer paper that one can take
> the long stream and divide it into sections. That's the approach I'm
> taking here. Its the same approach the a parallel package in R
> follows, and parallel frameworks like snow.
According to
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf
the sectioning of the stream to substreams is done by jump ahead algorithm.
The new seed is far enough in the sequence from the previous seed, so it
is guaranteed that the sequence generated from the previous seed is shorter
than the jump and the subsequences do not overlap. However, the new seed
is computable as a function of the previous seed. If we use this strategy
to produce a matrix of seeds
s_{1,1}, s_{1,2}, ...
s_{2,1}, s_{2,2}, ...
s_{3,1}, s_{2,2}, ...
then each s_{i,j} may be computed from s_{1,1} and i, j. In this case, it is
sufficient to store s_{1,1}. If we know for each run the indices i,j, then
we can compute s_{i,j} by an efficient algorithm.
> The different thing in my approach is that I'm saving one row of seeds
> per simulation "run". So each run can be replicated exactly.
>
> I hope.
Saving .Random.seed should be a safe strategy.
Petr Savicky.
More information about the R-devel
mailing list