[Bioc-devel] BiocParallel

Thu Nov 15 20:00:23 CET 2012

On 11/15/2012 10:53 AM, Henrik Bengtsson wrote:
> Is there any write up/discussion/plans on the various types of
> parallel computations out there:
>
> (1) one machine / multi-core/multi-threaded
> (2) multiple machines / multiple processes
> (3) batch / queue processing (on large compute clusters with many users).
> (4) ...
>
> Are we/you mainly focusing on (1) and (2)?

open for discussion; 1 & 2 are a good starting point for current scope. 
r-pbd.org is relevant for 3.

Not sure how to best facilitate this conversation / prioritization on github? if 
possible we should move the conversation there.

Martin

>
> /Henrik
>
> On Thu, Nov 15, 2012 at 6:21 AM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>> I'll second Ryan's patch (at least in principle).  When I parallelize
>> across multiple cores, I have always found mc.preschedule to be an
>> important option to expose (that, and the number of cores, is all I
>> use routinely).
>>
>> Kasper
>>
>> On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson <rct at thompsonclan.org> wrote:
>>> I just submitted a pull request. I'll add tests shortly if I can figure out
>>> how to write them.
>>>
>>>
>>> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:
>>>>
>>>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:
>>>>>
>>>>> Here are two alternative implementations of pvec. pvec2 is just a
>>>>> simple rewrite
>>>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a
>>>>> specified chunk
>>>>> size or a specified number of chunks. If the number of chunks exceeds
>>>>> the number
>>>>> of cores, then multiple chunks will get run sequentially on each
>>>>> core. pvec3
>>>>> also exposes the "mc.prescheule" argument of mclapply, since that is
>>>>> relevant
>>>>> when there are more chunks than cores. Lastly, I provide a
>>>>> "pvectorize" function
>>>>> which can be called on a regular vectorized function to make it into
>>>>> a pvec'd
>>>>> version of itself. Usage is like: sqrt.parallel <- pvectorize(sqrt);
>>>>> sqrt.parallel(1:1000).
>>>>>
>>>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
>>>>>                     mc.cores = getOption("mc.cores", 2L), mc.cleanup =
>>>>> TRUE)
>>>>> {
>>>>>     env <- parent.frame()
>>>>>     cores <- as.integer(mc.cores)
>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>>
>>>>>     if(mc.set.seed) mc.reset.stream()
>>>>>
>>>>>     n <- length(v)
>>>>>     si <- splitIndices(n, cores)
>>>>>     res <- do.call(c,
>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>>                             mc.set.seed=mc.set.seed,
>>>>>                             mc.silent=mc.silent,
>>>>>                             mc.cores=mc.cores,
>>>>>                             mc.cleanup=mc.cleanup))
>>>>>     if (length(res) != n)
>>>>>       warning("some results may be missing, folded or caused an error")
>>>>>     res
>>>>> }
>>>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
>>>>>                     mc.cores = getOption("mc.cores", 2L), mc.cleanup =
>>>>> TRUE,
>>>>>                     mc.preschedule=FALSE, num.chunks, chunk.size)
>>>>> {
>>>>>     env <- parent.frame()
>>>>>     cores <- as.integer(mc.cores)
>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>>
>>>>>     if(mc.set.seed) mc.reset.stream()
>>>>>
>>>>>     n <- length(v)
>>>>>     if (missing(num.chunks)) {
>>>>>       if (missing(chunk.size)) {
>>>>>         num.chunks <- cores
>>>>>       } else {
>>>>>         num.chunks <- ceiling(n/chunk.size)
>>>>>       }
>>>>>     }
>>>>>     si <- splitIndices(n, num.chunks)
>>>>>     res <- do.call(c,
>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>>                             mc.set.seed=mc.set.seed,
>>>>>                             mc.silent=mc.silent,
>>>>>                             mc.cores=mc.cores,
>>>>>                             mc.cleanup=mc.cleanup,
>>>>>                             mc.preschedule=mc.preschedule))
>>>>>     if (length(res) != n)
>>>>>       warning("some results may be missing, folded or caused an error")
>>>>>     res
>>>>> }
>>>>>
>>>>> pvectorize <- function(FUN) {
>>>>>     function(...) pvec3(FUN=FUN, ...)
>>>>> }
>>>>
>>>>
>>>> would be great to have these as 'pull' requests in github; pvec3 as a
>>>> replacement for pvec, if it's implementing the same concept but better.
>>>>
>>>> Unit tests would be good (yes being a little hypocritical).
>>>> inst/unitTests, using RUnit, examples in
>>>>
>>>>
>>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges/inst/unitTests
>>>>
>>>>
>>>> with username / password readonly
>>>>
>>>> Martin
>>>>
>>>>> On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan <mtmorgan at fhcrc.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Interested developers -- I added the start of a BiocParallel
>>>>>>> package to
>>>>>>> the Bioconductor subversion repository and build system.
>>>>>>>
>>>>>>> The package is mirrored on github to allow for social coding; I
>>>>>>> encourage
>>>>>>> people to contribute via that mechanism.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/**Bioconductor/BiocParallel<https://github.com/Bioconductor/BiocParallel>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The purpose is to help focus our efforts at developing appropriate
>>>>>>> parallel paradigms. Currently the package Imports: parallel and
>>>>>>> implements
>>>>>>> pvec and mclapply in a way that allows for operation on any vector
>>>>>>> or list
>>>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in
>>>>>>> particular seems to be appropriate for GRanges-like objects, where
>>>>>>> we don't
>>>>>>> necessarily want to extract many thousands of S4 instances of
>>>>>>> individual
>>>>>>> ranges with [[.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Makes sense. Besides, [[ does not even work on GRanges. One
>>>>>> limitation of
>>>>>> pvec is that it does not support a chunk size; it just uses length(x) /
>>>>>> ncores. It would be nice to be able to restrict that, which would then
>>>>>> require multiple jobs per core. Unless I'm missing something.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hopefully the ideas in the package can be folded back in to
>>>>>>> parallel as
>>>>>>> they mature.
>>>>>>>
>>>>>>> Martin
>>>>>>> --
>>>>>>> Dr. Martin Morgan, PhD
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>
>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109