[Bioc-devel] reproducible with mclapply?
Vladislav Petyuk
petyuk at gmail.com
Thu Jun 4 19:20:13 CEST 2015
The only bad thing I see so far in using set.seed inside the function is
that it interferes with previously set seed by the user. So follow-up
stochastic computation will be out user's control. Perhaps there are other
undesirable effect that I do not see at this point.
I tweaked the solution a bit here that wraps mclapply/lapply and maintains
the user control of stochasticity by resetting the seed to some random
value generated based on users input.
http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
I tend to agree though that in a long run doRNG is the way to go.
On Thu, Jun 4, 2015 at 8:15 AM, Valerie Obenchain <vobencha at fredhutch.org>
wrote:
> I'll add a section to the BiocParallel docs.
>
> Valerie
>
> On 06/04/2015 07:55 AM, Kasper Daniel Hansen wrote:
>
>> Yes, based on the documentation that particular random stream generator
>> would work with mclapply.
>>
>> This is absolutely a subject which ought to be covered in the BiocParallel
>> documentation.
>>
>> And commenting on another set of recommendations: please NEVER used
>> set.seed inside a function. Unfortunately, because of the way R works,
>> this is a really bad idea. As is functions with arguments like (set.seed
>> =
>> FALSE). Users need to be educated about this. The main issue with using
>> set.seed is when your work is wrapped into other peoples code, for example
>> with an external bootstrap or similar. I understand the desire for
>> reproducibility, but the design of the random generator in R is such that
>> this should really be left to the user.
>>
>> Kasper
>>
>> On Thu, Jun 4, 2015 at 10:39 AM, Vincent Carey <
>> stvjc at channing.harvard.edu>
>>
>> wrote:
>>
>> It does appear to me that the doRNG vignette sec 1.1 describes a solution
>>> to the problem posed. It is less clear to me that this method is readily
>>> adopted with BiocParallel unless registerDoPar is in use.... Should we
>>> address this topic explicitly in the vignette?
>>>
>>> On Thu, Jun 4, 2015 at 9:50 AM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>> Note you're not guaranteed that two random streams starting with
>>>> different
>>>> seeds will be (approximately) independent, so the suggestion on SO makes
>>>> the numbers reproducible but technically wrong.
>>>>
>>>> If you want true independence you either need to use a parallel version
>>>> of
>>>> the random number generator or you do what I suggested. Because of how
>>>> mclapply works (via fork) it is not clear to me that it is possible to
>>>> use
>>>> a parallel version of the random number generator, but I am not sure
>>>> about
>>>> this. The snippet from the documentation quoted above suggests I am
>>>> wrong.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk <petyuk at gmail.com>
>>>> wrote:
>>>>
>>>> There are different ways set.seed can be used. The way it is suggested
>>>>>
>>>> on
>>>>
>>>>> the aforementioned stackoverflow post is basically a two stage process.
>>>>> First seed is provided by a user (set.seed(1)). That is user can
>>>>> change
>>>>> the outcome from run to run. Based on that seed, a vector of
>>>>> randomized
>>>>> seeds is generated (seeds <- sample.int(length(input), replace=TRUE)).
>>>>> Those seeds are basically arguments to the function under
>>>>>
>>>> mclapply/lapply
>>>>
>>>>> that help to control random number generation for each iteration
>>>>>
>>>> (set.seed
>>>>
>>>>> (seeds[idx])).
>>>>> There are two different roles of set.seed. First left the user to
>>>>>
>>>> control
>>>>
>>>>> random number generation and the second (within the function) makes
>>>>> sure
>>>>> that it is the same for individual iterations regardless how the loop
>>>>> is
>>>>> executed.
>>>>> Does that make sense?
>>>>>
>>>>> On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang <gcyu at connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>> There is one possible solution posted in
>>>>>>
>>>>>>
>>>>>>
>>>> http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984
>>>>
>>>>> .
>>>>>>
>>>>>> As Kasper suggested, it's not a proper way to use set.seed inside a
>>>>>> package.
>>>>>>
>>>>>> I suggest using a parameter for example seed=FALSE to disable the
>>>>>>
>>>>> set.seed
>>>>
>>>>> and if user want the result reproducible, e.g. in demonstration, set
>>>>>> seed=TRUE explicitly and set.seed will be run inside the function.
>>>>>>
>>>>>> Bests,
>>>>>> Guangchuang
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>
>>>>>> For this situation, generate the permutation indexes outside of the
>>>>>>> mclapply, and the do mclapply over a list with the indices.
>>>>>>>
>>>>>>> And btw., please don't use set.seed inside a package; that control
>>>>>>>
>>>>>> should
>>>>>>
>>>>>>> completely be left to the user.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kasper
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <
>>>>>>>
>>>>>> stvjc at channing.harvard.edu>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> This document indicates how to achieve reproducibility independent
>>>>>>>>
>>>>>>> of
>>>>
>>>>> the
>>>>>>
>>>>>>> underlying physical environment.
>>>>>>>>
>>>>>>>> http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>>>>>>>>
>>>>>>>> Let me know if that satisfies the question.
>>>>>>>>
>>>>>>>> On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <
>>>>>>>>
>>>>>>> gcyu at connect.hku.hk>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Der Vincent,
>>>>>>>>>
>>>>>>>>> RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.
>>>>>>>>>
>>>>>>>>> When mc.cores changes, the output is not reproducible.
>>>>>>>>>
>>>>>>>>> I think this issue is also of concern within the Bioconductor
>>>>>>>>>
>>>>>>>> community
>>>>>>
>>>>>>> as parallel version of permutation test is commonly used now.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>>
>>>>>>>>> Guangchuang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <
>>>>>>>>>
>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi, this question belongs on R-help, but perhaps
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>> https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html
>>>>
>>>>>
>>>>>>>>>> will be useful.
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <
>>>>>>>>>>
>>>>>>>>> gcyu at connect.hku.hk>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Dear all,
>>>>>>>>>>>
>>>>>>>>>>> I have an issue of setting seed value when using parallel
>>>>>>>>>>>
>>>>>>>>>> package.
>>>>
>>>>>
>>>>>>>>>>> library("parallel")
>>>>>>>>>>>> library("digest")
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> + mc.cores=2)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "4827c80c"
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> + mc.cores=2)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "e95b9134"
>>>>>>>>>>>
>>>>>>>>>>> By default, set.seed() will be ignored since mclapply will set
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>> seed
>>>>>>>>
>>>>>>>>> internally.
>>>>>>>>>>>
>>>>>>>>>>> If we use mc.set.seed=FALSE to disable this feature. It works as
>>>>>>>>>>> indicated
>>>>>>>>>>> below:
>>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> + mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> + mc.cores=2, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "6bbada78"
>>>>>>>>>>>
>>>>>>>>>>> The problems is that the results are also depending on the
>>>>>>>>>>>
>>>>>>>>>> number
>>>>
>>>>> of
>>>>>>
>>>>>>> cores.
>>>>>>>>>>>
>>>>>>>>>>> set.seed(0)
>>>>>>>>>>>> m <- mclapply(1:10, function(x) sample(1:10),
>>>>>>>>>>>>
>>>>>>>>>>> + mc.cores=4, mc.set.seed = FALSE)
>>>>>>>>>>>
>>>>>>>>>>>> digest(m, 'crc32')
>>>>>>>>>>>>
>>>>>>>>>>> [1] "a22e0aab"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any idea?
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Guangchuang
>>>>>>>>>>> --
>>>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>>>> School of Public Health
>>>>>>>>>>> The University of Hong Kong
>>>>>>>>>>> Hong Kong SAR, China
>>>>>>>>>>> www: http://ygc.name
>>>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>>>
>>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>>> Guangchuang Yu, PhD Candidate
>>>>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>>>>> School of Public Health
>>>>>>>>> The University of Hong Kong
>>>>>>>>> Hong Kong SAR, China
>>>>>>>>> www: http://ygc.name
>>>>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>>>>
>>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>> Guangchuang Yu, PhD Candidate
>>>>>> State Key Laboratory of Emerging Infectious Diseases
>>>>>> School of Public Health
>>>>>> The University of Hong Kong
>>>>>> Hong Kong SAR, China
>>>>>> www: http://ygc.name
>>>>>> -~----------~----~----~----~------~----~------~--~---
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, Seattle, WA 98109
>
> Email: vobencha at fredhutch.org
> Phone: (206) 667-3158
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list