[Rd] portable parallel seeds project: request for critiques

Petr Savicky savicky at cs.cas.cz
Fri Mar 2 13:36:08 CET 2012

On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote:
> Hello,
> I would be also in favor for using multiple seeds based on (seed,
> task_number) for convenience (i.e. avoiding storing the seeds)
> and with the possibility of having a dynamic number of tasks, but I am mot
> sure it is theoretically correct.
> But I can refer you to this article:
> http://www.agner.org/random/ran-instructions.pdf , section 6.1
> where the author states:
> For example, if we make 100 streams of 10^10 random numbers each from an
> > SFMT
> > generator with cycle length ρ = 2^11213, we have a probability of overlap
> > p ≈ 10^3362.
> >
> What do you think ? I am very concerned by the correctness of this approach
> so would appreciate any advice on that matter.


First, let us consider a theoretical scenario, where the starting
points are chosen independently and from the uniform distribution.
Then the above should be OK.

SFMT supports several different periods, one of them is 2^11213-1.
If we choose 100 random starting points in the cycle of this
length and consider intervals of length 10^10 each, then two
of them overlap with probability

  (2 * 10^10 - 1) / (2^11213 - 1)

which is approx 10^-3365. An upper bound on the probability that
any of the 100 overlap is

  choose(100, 2) * (2 * 10^10 - 1) / (2^11213 - 1)

which is still negligible and approx 10^-3362 as you say except
of a typo in the sign.

The crucial assumption is that we choose the starting points
from the uniform distribution over the period. Every point
on the period cycle is represented by a different internal
state of the generator. So, uniform distribution on the period
is equivalent to the uniform distribution on admissible states. 
An admissible state is represented by 11213 bits, which are
not all 0. The probability of all 0 is extremely small. So, 
in order to generate one such state, it is sufficient to fill
the state array with uniform i.i.d. bits. This is suitable
only as a theoretical model, since this can guarantee
reproducibility only if we store the whole state.

In order to guarantee reproducibility with simpler means, we
cannot fill the initial state with uniform i.i.d. random bits.
So, we do not have a uniform distribution over the period, but we
can have different approximations to it. A cryptographically
strong hash function is such an approximation in the sense
of computational indistinguishability. It does not and also
cannot approximate the uniform distribution in the statistical
sense, since the set of possible outputs is much smaller than
the period. However, if someone gives us two initial states,
one from the hash function and the other created from uniform
i.i.d. bits, then there is no known computational method to
distinguish these two in moderate CPU time. So, you can safely
use the output of the hash function for simulations.

In fact, the requirements of simulations are weaker. For example,
MD5 cannot be used for cryptography any more, since there are known
algorithms to break it. However, if you use it for a simulation,
then the simulation will be biased only if it contains an algorithm,
which breaks MD5. The probability that this happens just by chance
is small.

Petr Savicky.

More information about the R-devel mailing list