[Bioc-devel] reproducible with mclapply?

Valerie Obenchain vobencha at fredhutch.org
Thu Jun 4 17:15:56 CEST 2015


I'll add a section to the BiocParallel docs.

Valerie

On 06/04/2015 07:55 AM, Kasper Daniel Hansen wrote:
> Yes, based on the documentation that particular random stream generator
> would work with mclapply.
>
> This is absolutely a subject which ought to be covered in the BiocParallel
> documentation.
>
> And commenting on another set of recommendations: please NEVER used
> set.seed inside a function.  Unfortunately, because of the way R works,
> this is a really bad idea.  As is functions with arguments like (set.seed =
> FALSE).  Users need to be educated about this.  The main issue with using
> set.seed is when your work is wrapped into other peoples code, for example
> with an external bootstrap or similar.  I understand the desire for
> reproducibility, but the design of the random generator in R is such that
> this should really be left to the user.
>
> Kasper
>
> On Thu, Jun 4, 2015 at 10:39 AM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>> It does appear to me that the doRNG vignette sec 1.1 describes a solution
>> to the problem posed.  It is less clear to me that this method is readily
>> adopted with BiocParallel unless registerDoPar is in use....  Should we
>> address this topic explicitly in the vignette?
>>
>> On Thu, Jun 4, 2015 at 9:50 AM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>
>>> Note you're not guaranteed that two random streams starting with different
>>> seeds will be (approximately) independent, so the suggestion on SO makes
>>> the numbers reproducible but technically wrong.
>>>
>>> If you want true independence you either need to use a parallel version of
>>> the random number generator or you do what I suggested.  Because of how
>>> mclapply works (via fork) it is not clear to me that it is possible to use
>>> a parallel version of the random number generator, but I am not sure about
>>> this.  The snippet from the documentation quoted above suggests I am
>>> wrong.
>>>
>>> Best,
>>> Kasper
>>>
>>> On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk <petyuk at gmail.com>
>>> wrote:
>>>
>>>> There are different ways set.seed can be used.  The way it is suggested
>>> on
>>>> the aforementioned stackoverflow post is basically a two stage process.
>>>> First seed is provided by a user (set.seed(1)).  That is user can change
>>>> the outcome from run to run.  Based on that seed, a vector of randomized
>>>> seeds is generated (seeds <- sample.int(length(input), replace=TRUE)).
>>>> Those seeds are basically arguments to the function under
>>> mclapply/lapply
>>>> that help to control random number generation for each iteration
>>> (set.seed
>>>> (seeds[idx])).
>>>> There are two different roles of set.seed. First left the user to
>>> control
>>>> random number generation and the second (within the function) makes sure
>>>> that it is the same for individual iterations regardless how the loop is
>>>> executed.
>>>> Does that make sense?
>>>>
>>>> On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang <gcyu at connect.hku.hk>
>>>> wrote:
>>>>
>>>>> There is one possible solution posted in
>>>>>
>>>>>
>>> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
>>>>> .
>>>>>
>>>>> As Kasper suggested, it's not a proper way to use set.seed inside a
>>>>> package.
>>>>>
>>>>> I suggest using a parameter for example seed=FALSE to disable the
>>> set.seed
>>>>> and if user want the result reproducible, e.g. in demonstration, set
>>>>> seed=TRUE explicitly and set.seed will be run inside the function.
>>>>>
>>>>> Bests,
>>>>> Guangchuang
>>>>>
>>>>> On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>>> For this situation, generate the permutation indexes outside of the
>>>>>> mclapply, and the do mclapply over a list with the indices.
>>>>>>
>>>>>> And btw., please don't use set.seed inside a package; that control
>>>>> should
>>>>>> completely be left to the user.
>>>>>>
>>>>>> Best,
>>>>>> Kasper
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <
>>>>> stvjc at channing.harvard.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> This document indicates how to achieve reproducibility independent
>>> of
>>>>> the
>>>>>>> underlying physical environment.
>>>>>>>
>>>>>>> http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>>>>>>>
>>>>>>> Let me know if that satisfies the question.
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <
>>> gcyu at connect.hku.hk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Der Vincent,
>>>>>>>>
>>>>>>>> RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.
>>>>>>>>
>>>>>>>> When mc.cores changes, the output is not reproducible.
>>>>>>>>
>>>>>>>> I think this issue is also of concern within the Bioconductor
>>>>> community
>>>>>>> as parallel version of permutation test is commonly used now.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Guangchuang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <
>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, this question belongs on R-help, but perhaps
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>> https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html
>>>>>>>>>
>>>>>>>>> will be useful.
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <
>>>>> gcyu at connect.hku.hk>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Dear all,
>>>>>>>>>>
>>>>>>>>>> I have an issue of setting seed value when using parallel
>>> package.
>>>>>>>>>>
>>>>>>>>>>> library("parallel")
>>>>>>>>>>> library("digest")
>>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>> [1] "4827c80c"
>>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>> [1] "e95b9134"
>>>>>>>>>>
>>>>>>>>>> By default, set.seed() will be ignored since mclapply will set
>>> the
>>>>>>> seed
>>>>>>>>>> internally.
>>>>>>>>>>
>>>>>>>>>> If we use mc.set.seed=FALSE to disable this feature. It works as
>>>>>>>>>> indicated
>>>>>>>>>> below:
>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>
>>>>>>>>>> The problems is that the results are also depending on the
>>> number
>>>>> of
>>>>>>>>>> cores.
>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>> +               mc.cores=4, mc.set.seed = FALSE)
>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>> [1] "a22e0aab"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any idea?
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Guangchuang
>>>>>>>>>> --
>>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>>> School of Public Health
>>>>>>>>>> The University of Hong Kong
>>>>>>>>>> Hong Kong SAR, China
>>>>>>>>>> www: http://ygc.name
>>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>>
>>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>> School of Public Health
>>>>>>>> The University of Hong Kong
>>>>>>>> Hong Kong SAR, China
>>>>>>>> www: http://ygc.name
>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>
>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>> Guangchuang Yu, PhD Candidate
>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>> School of Public Health
>>>>> The University of Hong Kong
>>>>> Hong Kong SAR, China
>>>>> www: http://ygc.name
>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158



More information about the Bioc-devel mailing list