[Bioc-devel] reproducible with mclapply?

Vladislav Petyuk petyuk at gmail.com
Thu Jun 4 19:20:13 CEST 2015


The only bad thing I see so far in using set.seed inside the function is
that it interferes with previously set seed by the user.  So follow-up
stochastic computation will be out user's control.  Perhaps there are other
undesirable effect that I do not see at this point.
I tweaked the solution a bit here that wraps mclapply/lapply and maintains
the user control of stochasticity by resetting the seed to some random
value generated based on users input.
http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
I tend to agree though that in a long run doRNG is the way to go.

On Thu, Jun 4, 2015 at 8:15 AM, Valerie Obenchain <vobencha at fredhutch.org>
wrote:

> I'll add a section to the BiocParallel docs.
>
> Valerie
>
> On 06/04/2015 07:55 AM, Kasper Daniel Hansen wrote:
>
>> Yes, based on the documentation that particular random stream generator
>> would work with mclapply.
>>
>> This is absolutely a subject which ought to be covered in the BiocParallel
>> documentation.
>>
>> And commenting on another set of recommendations: please NEVER used
>> set.seed inside a function.  Unfortunately, because of the way R works,
>> this is a really bad idea.  As is functions with arguments like (set.seed
>> =
>> FALSE).  Users need to be educated about this.  The main issue with using
>> set.seed is when your work is wrapped into other peoples code, for example
>> with an external bootstrap or similar.  I understand the desire for
>> reproducibility, but the design of the random generator in R is such that
>> this should really be left to the user.
>>
>> Kasper
>>
>> On Thu, Jun 4, 2015 at 10:39 AM, Vincent Carey <
>> stvjc at channing.harvard.edu>
>>
>> wrote:
>>
>>  It does appear to me that the doRNG vignette sec 1.1 describes a solution
>>> to the problem posed.  It is less clear to me that this method is readily
>>> adopted with BiocParallel unless registerDoPar is in use....  Should we
>>> address this topic explicitly in the vignette?
>>>
>>> On Thu, Jun 4, 2015 at 9:50 AM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>  Note you're not guaranteed that two random streams starting with
>>>> different
>>>> seeds will be (approximately) independent, so the suggestion on SO makes
>>>> the numbers reproducible but technically wrong.
>>>>
>>>> If you want true independence you either need to use a parallel version
>>>> of
>>>> the random number generator or you do what I suggested.  Because of how
>>>> mclapply works (via fork) it is not clear to me that it is possible to
>>>> use
>>>> a parallel version of the random number generator, but I am not sure
>>>> about
>>>> this.  The snippet from the documentation quoted above suggests I am
>>>> wrong.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk <petyuk at gmail.com>
>>>> wrote:
>>>>
>>>>  There are different ways set.seed can be used.  The way it is suggested
>>>>>
>>>> on
>>>>
>>>>> the aforementioned stackoverflow post is basically a two stage process.
>>>>> First seed is provided by a user (set.seed(1)).  That is user can
>>>>> change
>>>>> the outcome from run to run.  Based on that seed, a vector of
>>>>> randomized
>>>>> seeds is generated (seeds <- sample.int(length(input), replace=TRUE)).
>>>>> Those seeds are basically arguments to the function under
>>>>>
>>>> mclapply/lapply
>>>>
>>>>> that help to control random number generation for each iteration
>>>>>
>>>> (set.seed
>>>>
>>>>> (seeds[idx])).
>>>>> There are two different roles of set.seed. First left the user to
>>>>>
>>>> control
>>>>
>>>>> random number generation and the second (within the function) makes
>>>>> sure
>>>>> that it is the same for individual iterations regardless how the loop
>>>>> is
>>>>> executed.
>>>>> Does that make sense?
>>>>>
>>>>> On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang <gcyu at connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>  There is one possible solution posted in
>>>>>>
>>>>>>
>>>>>>
>>>> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
>>>>
>>>>> .
>>>>>>
>>>>>> As Kasper suggested, it's not a proper way to use set.seed inside a
>>>>>> package.
>>>>>>
>>>>>> I suggest using a parameter for example seed=FALSE to disable the
>>>>>>
>>>>> set.seed
>>>>
>>>>> and if user want the result reproducible, e.g. in demonstration, set
>>>>>> seed=TRUE explicitly and set.seed will be run inside the function.
>>>>>>
>>>>>> Bests,
>>>>>> Guangchuang
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>
>>>>>>  For this situation, generate the permutation indexes outside of the
>>>>>>> mclapply, and the do mclapply over a list with the indices.
>>>>>>>
>>>>>>> And btw., please don't use set.seed inside a package; that control
>>>>>>>
>>>>>> should
>>>>>>
>>>>>>> completely be left to the user.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kasper
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <
>>>>>>>
>>>>>> stvjc at channing.harvard.edu>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  This document indicates how to achieve reproducibility independent
>>>>>>>>
>>>>>>> of
>>>>
>>>>> the
>>>>>>
>>>>>>> underlying physical environment.
>>>>>>>>
>>>>>>>> http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>>>>>>>>
>>>>>>>> Let me know if that satisfies the question.
>>>>>>>>
>>>>>>>> On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <
>>>>>>>>
>>>>>>> gcyu at connect.hku.hk>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>  Der Vincent,
>>>>>>>>>
>>>>>>>>> RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.
>>>>>>>>>
>>>>>>>>> When mc.cores changes, the output is not reproducible.
>>>>>>>>>
>>>>>>>>> I think this issue is also of concern within the Bioconductor
>>>>>>>>>
>>>>>>>> community
>>>>>>
>>>>>>> as parallel version of permutation test is commonly used now.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>>
>>>>>>>>> Guangchuang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <
>>>>>>>>>
>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Hi, this question belongs on R-help, but perhaps
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>> https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html
>>>>
>>>>>
>>>>>>>>>> will be useful.
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <
>>>>>>>>>>
>>>>>>>>> gcyu at connect.hku.hk>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Dear all,
>>>>>>>>>>>
>>>>>>>>>>> I have an issue of setting seed value when using parallel
>>>>>>>>>>>
>>>>>>>>>> package.
>>>>
>>>>>
>>>>>>>>>>>  library("parallel")
>>>>>>>>>>>> library("digest")
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "4827c80c"
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "e95b9134"
>>>>>>>>>>>
>>>>>>>>>>> By default, set.seed() will be ignored since mclapply will set
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>> seed
>>>>>>>>
>>>>>>>>> internally.
>>>>>>>>>>>
>>>>>>>>>>> If we use mc.set.seed=FALSE to disable this feature. It works as
>>>>>>>>>>> indicated
>>>>>>>>>>> below:
>>>>>>>>>>>
>>>>>>>>>>>  set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>
>>>>>>>>>>> The problems is that the results are also depending on the
>>>>>>>>>>>
>>>>>>>>>> number
>>>>
>>>>> of
>>>>>>
>>>>>>> cores.
>>>>>>>>>>>
>>>>>>>>>>>  set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> +               mc.cores=4, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "a22e0aab"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any idea?
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Guangchuang
>>>>>>>>>>> --
>>>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>>>> School of Public Health
>>>>>>>>>>> The University of Hong Kong
>>>>>>>>>>> Hong Kong SAR, China
>>>>>>>>>>> www: http://ygc.name
>>>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>>>
>>>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>> School of Public Health
>>>>>>>>> The University of Hong Kong
>>>>>>>>> Hong Kong SAR, China
>>>>>>>>> www: http://ygc.name
>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>
>>>>>>>>>
>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>> Guangchuang Yu, PhD Candidate
>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>> School of Public Health
>>>>>> The University of Hong Kong
>>>>>> Hong Kong SAR, China
>>>>>> www: http://ygc.name
>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>
>>>>>>          [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, Seattle, WA 98109
>
> Email: vobencha at fredhutch.org
> Phone: (206) 667-3158
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list