[Bioc-devel] reproducible with mclapply?

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Thu Jun 4 19:31:02 CEST 2015


You're ignoring the fact that some random number generators should never be
used inside of mclapply(), period.  You should add that to your post and
you should show how to set the random number generator appropriately.

You seem to be only focusing on reproducibility of the code and not
correctness.

Kasper

On Thu, Jun 4, 2015 at 1:20 PM, Vladislav Petyuk <petyuk at gmail.com> wrote:

> The only bad thing I see so far in using set.seed inside the function is
> that it interferes with previously set seed by the user.  So follow-up
> stochastic computation will be out user's control.  Perhaps there are other
> undesirable effect that I do not see at this point.
> I tweaked the solution a bit here that wraps mclapply/lapply and maintains
> the user control of stochasticity by resetting the seed to some random
> value generated based on users input.
>
> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
> I tend to agree though that in a long run doRNG is the way to go.
>
> On Thu, Jun 4, 2015 at 8:15 AM, Valerie Obenchain <vobencha at fredhutch.org>
> wrote:
>
>> I'll add a section to the BiocParallel docs.
>>
>> Valerie
>>
>> On 06/04/2015 07:55 AM, Kasper Daniel Hansen wrote:
>>
>>> Yes, based on the documentation that particular random stream generator
>>> would work with mclapply.
>>>
>>> This is absolutely a subject which ought to be covered in the
>>> BiocParallel
>>> documentation.
>>>
>>> And commenting on another set of recommendations: please NEVER used
>>> set.seed inside a function.  Unfortunately, because of the way R works,
>>> this is a really bad idea.  As is functions with arguments like
>>> (set.seed =
>>> FALSE).  Users need to be educated about this.  The main issue with using
>>> set.seed is when your work is wrapped into other peoples code, for
>>> example
>>> with an external bootstrap or similar.  I understand the desire for
>>> reproducibility, but the design of the random generator in R is such that
>>> this should really be left to the user.
>>>
>>> Kasper
>>>
>>> On Thu, Jun 4, 2015 at 10:39 AM, Vincent Carey <
>>> stvjc at channing.harvard.edu>
>>>
>>> wrote:
>>>
>>>  It does appear to me that the doRNG vignette sec 1.1 describes a
>>>> solution
>>>> to the problem posed.  It is less clear to me that this method is
>>>> readily
>>>> adopted with BiocParallel unless registerDoPar is in use....  Should we
>>>> address this topic explicitly in the vignette?
>>>>
>>>> On Thu, Jun 4, 2015 at 9:50 AM, Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com> wrote:
>>>>
>>>>  Note you're not guaranteed that two random streams starting with
>>>>> different
>>>>> seeds will be (approximately) independent, so the suggestion on SO
>>>>> makes
>>>>> the numbers reproducible but technically wrong.
>>>>>
>>>>> If you want true independence you either need to use a parallel
>>>>> version of
>>>>> the random number generator or you do what I suggested.  Because of how
>>>>> mclapply works (via fork) it is not clear to me that it is possible to
>>>>> use
>>>>> a parallel version of the random number generator, but I am not sure
>>>>> about
>>>>> this.  The snippet from the documentation quoted above suggests I am
>>>>> wrong.
>>>>>
>>>>> Best,
>>>>> Kasper
>>>>>
>>>>> On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk <petyuk at gmail.com>
>>>>> wrote:
>>>>>
>>>>>  There are different ways set.seed can be used.  The way it is
>>>>>> suggested
>>>>>>
>>>>> on
>>>>>
>>>>>> the aforementioned stackoverflow post is basically a two stage
>>>>>> process.
>>>>>> First seed is provided by a user (set.seed(1)).  That is user can
>>>>>> change
>>>>>> the outcome from run to run.  Based on that seed, a vector of
>>>>>> randomized
>>>>>> seeds is generated (seeds <- sample.int(length(input),
>>>>>> replace=TRUE)).
>>>>>> Those seeds are basically arguments to the function under
>>>>>>
>>>>> mclapply/lapply
>>>>>
>>>>>> that help to control random number generation for each iteration
>>>>>>
>>>>> (set.seed
>>>>>
>>>>>> (seeds[idx])).
>>>>>> There are two different roles of set.seed. First left the user to
>>>>>>
>>>>> control
>>>>>
>>>>>> random number generation and the second (within the function) makes
>>>>>> sure
>>>>>> that it is the same for individual iterations regardless how the loop
>>>>>> is
>>>>>> executed.
>>>>>> Does that make sense?
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang <gcyu at connect.hku.hk>
>>>>>> wrote:
>>>>>>
>>>>>>  There is one possible solution posted in
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
>>>>>
>>>>>> .
>>>>>>>
>>>>>>> As Kasper suggested, it's not a proper way to use set.seed inside a
>>>>>>> package.
>>>>>>>
>>>>>>> I suggest using a parameter for example seed=FALSE to disable the
>>>>>>>
>>>>>> set.seed
>>>>>
>>>>>> and if user want the result reproducible, e.g. in demonstration, set
>>>>>>> seed=TRUE explicitly and set.seed will be run inside the function.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Guangchuang
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
>>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>>
>>>>>>>  For this situation, generate the permutation indexes outside of the
>>>>>>>> mclapply, and the do mclapply over a list with the indices.
>>>>>>>>
>>>>>>>> And btw., please don't use set.seed inside a package; that control
>>>>>>>>
>>>>>>> should
>>>>>>>
>>>>>>>> completely be left to the user.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kasper
>>>>>>>>
>>>>>>>> On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <
>>>>>>>>
>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  This document indicates how to achieve reproducibility independent
>>>>>>>>>
>>>>>>>> of
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> underlying physical environment.
>>>>>>>>>
>>>>>>>>> http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>>>>>>>>>
>>>>>>>>> Let me know if that satisfies the question.
>>>>>>>>>
>>>>>>>>> On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <
>>>>>>>>>
>>>>>>>> gcyu at connect.hku.hk>
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Der Vincent,
>>>>>>>>>>
>>>>>>>>>> RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.
>>>>>>>>>>
>>>>>>>>>> When mc.cores changes, the output is not reproducible.
>>>>>>>>>>
>>>>>>>>>> I think this issue is also of concern within the Bioconductor
>>>>>>>>>>
>>>>>>>>> community
>>>>>>>
>>>>>>>> as parallel version of permutation test is commonly used now.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>>
>>>>>>>>>> Guangchuang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <
>>>>>>>>>>
>>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi, this question belongs on R-help, but perhaps
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>> https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html
>>>>>
>>>>>>
>>>>>>>>>>> will be useful.
>>>>>>>>>>>
>>>>>>>>>>> Best regards
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <
>>>>>>>>>>>
>>>>>>>>>> gcyu at connect.hku.hk>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>  Dear all,
>>>>>>>>>>>>
>>>>>>>>>>>> I have an issue of setting seed value when using parallel
>>>>>>>>>>>>
>>>>>>>>>>> package.
>>>>>
>>>>>>
>>>>>>>>>>>>  library("parallel")
>>>>>>>>>>>>> library("digest")
>>>>>>>>>>>>>
>>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>>
>>>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>>>
>>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>>
>>>>>>>>>>>> [1] "4827c80c"
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>>
>>>>>>>>>>>> +               mc.cores=2)
>>>>>>>>>>>>
>>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>>
>>>>>>>>>>>> [1] "e95b9134"
>>>>>>>>>>>>
>>>>>>>>>>>> By default, set.seed() will be ignored since mclapply will set
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>> seed
>>>>>>>>>
>>>>>>>>>> internally.
>>>>>>>>>>>>
>>>>>>>>>>>> If we use mc.set.seed=FALSE to disable this feature. It works as
>>>>>>>>>>>> indicated
>>>>>>>>>>>> below:
>>>>>>>>>>>>
>>>>>>>>>>>>  set.seed(0)
>>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>>
>>>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>>
>>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>>
>>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>>
>>>>>>>>>>>> +               mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>>
>>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>>
>>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>>
>>>>>>>>>>>> The problems is that the results are also depending on the
>>>>>>>>>>>>
>>>>>>>>>>> number
>>>>>
>>>>>> of
>>>>>>>
>>>>>>>> cores.
>>>>>>>>>>>>
>>>>>>>>>>>>  set.seed(0)
>>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>>
>>>>>>>>>>>> +               mc.cores=4, mc.set.seed = FALSE)
>>>>>>>>>>>>
>>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>>
>>>>>>>>>>>> [1] "a22e0aab"
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Any idea?
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> Guangchuang
>>>>>>>>>>>> --
>>>>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>>>>> School of Public Health
>>>>>>>>>>>> The University of Hong Kong
>>>>>>>>>>>> Hong Kong SAR, China
>>>>>>>>>>>> www: http://ygc.name
>>>>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>>>>
>>>>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>>> School of Public Health
>>>>>>>>>> The University of Hong Kong
>>>>>>>>>> Hong Kong SAR, China
>>>>>>>>>> www: http://ygc.name
>>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>> School of Public Health
>>>>>>> The University of Hong Kong
>>>>>>> Hong Kong SAR, China
>>>>>>> www: http://ygc.name
>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, Seattle, WA 98109
>>
>> Email: vobencha at fredhutch.org
>> Phone: (206) 667-3158
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list