[Bioc-devel] reproducible with mclapply?

Ramon Diaz-Uriarte rdiaz02 at gmail.com
Fri Jun 5 11:36:52 CEST 2015




On Thu, 04-06-2015, at 15:50, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
> Note you're not guaranteed that two random streams starting with different
> seeds will be (approximately) independent, so the suggestion on SO makes
> the numbers reproducible but technically wrong.
>

But whether or not that is a problem can depend strongly on what is done
inside each of the mclapply runs (i.e., what the FUN does) and what will be
done with the output from FUN.


Suppose you want to run FUN 10 times. You might do, among others, one of
the following:


1. mclapply of FUN, properly using a parallel version of a random number
generator, as documented in the "parallel" vignette.


2. Run FUN 10 times, one after the other, from the same R session (e.g.,
via a lapply or repeat or loop, etc). Before the first (but only before the
first) run you either set.seed(seed = something) or allow the seed to be
set automagically (or set.seed(seed = NULL)).


Both 1. and 2. should give you independent random streams for all 10 runs.


3. Run FUN once in each of 10 R sessions that are started from the shell
(more or less simultaneously). Each R session gets the seed as per the
default R mechanism (see "Note" in the help of set.seed) or, equivalently,
each R session calls set.seed(seed = NULL).


4. In the same R session, run FUN 10 times, doing set.seed(seed = NULL)
immediately before each run.


With both 3. and 4. each run will have a different seed. However you are
affected by " you're not guaranteed that two random streams starting with
different seeds will be (approximately) independent".



But doing things as in 3. or 4. might not be a problem for many FUNs and
many uses of the output of FUN (it might be for the original post, though).



That said, I fully agree that using set.seed inside the function does not
seem like a good idea.



Best,


R.







> If you want true independence you either need to use a parallel version of
> the random number generator or you do what I suggested.  Because of how
> mclapply works (via fork) it is not clear to me that it is possible to use
> a parallel version of the random number generator, but I am not sure about
> this.  The snippet from the documentation quoted above suggests I am wrong.
>
> Best,
> Kasper
>
> On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk <petyuk at gmail.com> wrote:
>
>> There are different ways set.seed can be used.  The way it is suggested on
>> the aforementioned stackoverflow post is basically a two stage process.
>> First seed is provided by a user (set.seed(1)).  That is user can change
>> the outcome from run to run.  Based on that seed, a vector of randomized
>> seeds is generated (seeds <- sample.int(length(input), replace=TRUE)).
>> Those seeds are basically arguments to the function under mclapply/lapply
>> that help to control random number generation for each iteration (set.seed
>> (seeds[idx])).
>> There are two different roles of set.seed. First left the user to control
>> random number generation and the second (within the function) makes sure
>> that it is the same for individual iterations regardless how the loop is
>> executed.
>> Does that make sense?
>>
>> On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang <gcyu at connect.hku.hk>
>> wrote:
>>
>>> There is one possible solution posted in
>>>
>>> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
>>> .
>>>
>>> As Kasper suggested, it's not a proper way to use set.seed inside a
>>> package.
>>>
>>> I suggest using a parameter for example seed=FALSE to disable the set.seed
>>> and if user want the result reproducible, e.g. in demonstration, set
>>> seed=TRUE explicitly and set.seed will be run inside the function.
>>>
>>> Bests,
>>> Guangchuang
>>>
>>> On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>> > For this situation, generate the permutation indexes outside of the
>>> > mclapply, and the do mclapply over a list with the indices.
>>> >
>>> > And btw., please don't use set.seed inside a package; that control
>>> should
>>> > completely be left to the user.
>>> >
>>> > Best,
>>> > Kasper
>>> >
>>> > On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <
>>> stvjc at channing.harvard.edu>
>>> > wrote:
>>> >
>>> >> This document indicates how to achieve reproducibility independent of
>>> the
>>> >> underlying physical environment.
>>> >>
>>> >> http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>>> >>
>>> >> Let me know if that satisfies the question.
>>> >>
>>> >> On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <gcyu at connect.hku.hk>
>>> >> wrote:
>>> >>
>>> >> > Der Vincent,
>>> >> >
>>> >> > RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.
>>> >> >
>>> >> > When mc.cores changes, the output is not reproducible.
>>> >> >
>>> >> > I think this issue is also of concern within the Bioconductor
>>> community
>>> >> as parallel version of permutation test is commonly used now.
>>> >> >
>>> >> > Best Regards,
>>> >> >
>>> >> > Guangchuang
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <
>>> >> stvjc at channing.harvard.edu>
>>> >> > wrote:
>>> >> >
>>> >> >> Hi, this question belongs on R-help, but perhaps
>>> >> >>
>>> >> >>
>>> >>
>>> https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html
>>> >> >>
>>> >> >> will be useful.
>>> >> >>
>>> >> >> Best regards
>>> >> >>
>>> >> >> On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <
>>> gcyu at connect.hku.hk>
>>> >> >> wrote:
>>> >> >>
>>> >> >>> Dear all,
>>> >> >>>
>>> >> >>> I have an issue of setting seed value when using parallel package.
>>> >> >>>
>>> >> >>> > library("parallel")
>>> >> >>> > library("digest")
>>> >> >>> >
>>> >> >>> > set.seed(0)
>>> >> >>> > m <- mclapply(1:10, function(x) sample(1:10),
>>> >> >>> +               mc.cores=2)
>>> >> >>> > digest(m, 'crc32')
>>> >> >>> [1] "4827c80c"
>>> >> >>> >
>>> >> >>> > set.seed(0)
>>> >> >>> > m <- mclapply(1:10, function(x) sample(1:10),
>>> >> >>> +               mc.cores=2)
>>> >> >>> > digest(m, 'crc32')
>>> >> >>> [1] "e95b9134"
>>> >> >>>
>>> >> >>> By default, set.seed() will be ignored since mclapply will set the
>>> >> seed
>>> >> >>> internally.
>>> >> >>>
>>> >> >>> If we use mc.set.seed=FALSE to disable this feature. It works as
>>> >> >>> indicated
>>> >> >>> below:
>>> >> >>>
>>> >> >>> > set.seed(0)
>>> >> >>> > m <- mclapply(1:10, function(x) sample(1:10),
>>> >> >>> +               mc.cores=2, mc.set.seed = FALSE)
>>> >> >>> > digest(m, 'crc32')
>>> >> >>> [1] "6bbada78"
>>> >> >>> >
>>> >> >>> > set.seed(0)
>>> >> >>> > m <- mclapply(1:10, function(x) sample(1:10),
>>> >> >>> +               mc.cores=2, mc.set.seed = FALSE)
>>> >> >>> > digest(m, 'crc32')
>>> >> >>> [1] "6bbada78"
>>> >> >>>
>>> >> >>> The problems is that the results are also depending on the number
>>> of
>>> >> >>> cores.
>>> >> >>>
>>> >> >>> > set.seed(0)
>>> >> >>> > m <- mclapply(1:10, function(x) sample(1:10),
>>> >> >>> +               mc.cores=4, mc.set.seed = FALSE)
>>> >> >>> > digest(m, 'crc32')
>>> >> >>> [1] "a22e0aab"
>>> >> >>>
>>> >> >>>
>>> >> >>> Any idea?
>>> >> >>>
>>> >> >>> Best Regards,
>>> >> >>> Guangchuang
>>> >> >>> --
>>> >> >>> --~--~---------~--~----~------------~-------~--~----~
>>> >> >>> Guangchuang Yu, PhD Candidate
>>> >> >>> State Key Laboratory of Emerging Infectious Diseases
>>> >> >>> School of Public Health
>>> >> >>> The University of Hong Kong
>>> >> >>> Hong Kong SAR, China
>>> >> >>> www: http://ygc.name
>>> >> >>> -~----------~----~----~----~------~----~------~--~---
>>> >> >>>
>>> >> >>>         [[alternative HTML version deleted]]
>>> >> >>>
>>> >> >>> _______________________________________________
>>> >> >>> Bioc-devel at r-project.org mailing list
>>> >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > --~--~---------~--~----~------------~-------~--~----~
>>> >> > Guangchuang Yu, PhD Candidate
>>> >> > State Key Laboratory of Emerging Infectious Diseases
>>> >> > School of Public Health
>>> >> > The University of Hong Kong
>>> >> > Hong Kong SAR, China
>>> >> > www: http://ygc.name
>>> >> > -~----------~----~----~----~------~----~------~--~---
>>> >> >
>>> >>
>>> >>         [[alternative HTML version deleted]]
>>> >>
>>> >> _______________________________________________
>>> >> Bioc-devel at r-project.org mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >>
>>> >
>>> >
>>>
>>>
>>> --
>>> --~--~---------~--~----~------------~-------~--~----~
>>> Guangchuang Yu, PhD Candidate
>>> State Key Laboratory of Emerging Infectious Diseases
>>> School of Public Health
>>> The University of Hong Kong
>>> Hong Kong SAR, China
>>> www: http://ygc.name
>>> -~----------~----~----~----~------~----~------~--~---
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz



More information about the Bioc-devel mailing list