[Bioc-devel] BiocParallel

Martin Morgan mtmorgan at fhcrc.org
Thu Nov 15 19:58:21 CET 2012


On 11/15/2012 6:21 AM, Kasper Daniel Hansen wrote:
> I'll second Ryan's patch (at least in principle).  When I parallelize
> across multiple cores, I have always found mc.preschedule to be an
> important option to expose (that, and the number of cores, is all I
> use routinely).

Yes, Ryan provided a pull request and I've incorporated it. His pvectorize is 
also quite fun...

   options(mc.cores=8)
   pcountOverlaps <- pvectorize(countOverlaps)

and for a GappedAlignments object with about 10 million ranges

 > system.time(yy <- countOverlaps(gal, exByTx, ignore.strand=TRUE))
    user  system elapsed
  79.997   6.941  87.208
 > system.time(xx <- pcountOverlaps(gal, exByTx, ignore.strand=TRUE))
    user  system elapsed
717.273  34.443  18.622

expect a fluid interface, though...

git / svn gurus: Ryan gets attribution in the git repo, but using git svn 
dcommit I get all the credit. Tricks?

Martin

>
> Kasper
>
> On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson <rct at thompsonclan.org> wrote:
>> I just submitted a pull request. I'll add tests shortly if I can figure out
>> how to write them.
>>
>>
>> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:
>>>
>>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:
>>>>
>>>> Here are two alternative implementations of pvec. pvec2 is just a
>>>> simple rewrite
>>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a
>>>> specified chunk
>>>> size or a specified number of chunks. If the number of chunks exceeds
>>>> the number
>>>> of cores, then multiple chunks will get run sequentially on each
>>>> core. pvec3
>>>> also exposes the "mc.prescheule" argument of mclapply, since that is
>>>> relevant
>>>> when there are more chunks than cores. Lastly, I provide a
>>>> "pvectorize" function
>>>> which can be called on a regular vectorized function to make it into
>>>> a pvec'd
>>>> version of itself. Usage is like: sqrt.parallel <- pvectorize(sqrt);
>>>> sqrt.parallel(1:1000).
>>>>
>>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
>>>>                     mc.cores = getOption("mc.cores", 2L), mc.cleanup =
>>>> TRUE)
>>>> {
>>>>     env <- parent.frame()
>>>>     cores <- as.integer(mc.cores)
>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>
>>>>     if(mc.set.seed) mc.reset.stream()
>>>>
>>>>     n <- length(v)
>>>>     si <- splitIndices(n, cores)
>>>>     res <- do.call(c,
>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>                             mc.set.seed=mc.set.seed,
>>>>                             mc.silent=mc.silent,
>>>>                             mc.cores=mc.cores,
>>>>                             mc.cleanup=mc.cleanup))
>>>>     if (length(res) != n)
>>>>       warning("some results may be missing, folded or caused an error")
>>>>     res
>>>> }
>>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
>>>>                     mc.cores = getOption("mc.cores", 2L), mc.cleanup =
>>>> TRUE,
>>>>                     mc.preschedule=FALSE, num.chunks, chunk.size)
>>>> {
>>>>     env <- parent.frame()
>>>>     cores <- as.integer(mc.cores)
>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>
>>>>     if(mc.set.seed) mc.reset.stream()
>>>>
>>>>     n <- length(v)
>>>>     if (missing(num.chunks)) {
>>>>       if (missing(chunk.size)) {
>>>>         num.chunks <- cores
>>>>       } else {
>>>>         num.chunks <- ceiling(n/chunk.size)
>>>>       }
>>>>     }
>>>>     si <- splitIndices(n, num.chunks)
>>>>     res <- do.call(c,
>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>                             mc.set.seed=mc.set.seed,
>>>>                             mc.silent=mc.silent,
>>>>                             mc.cores=mc.cores,
>>>>                             mc.cleanup=mc.cleanup,
>>>>                             mc.preschedule=mc.preschedule))
>>>>     if (length(res) != n)
>>>>       warning("some results may be missing, folded or caused an error")
>>>>     res
>>>> }
>>>>
>>>> pvectorize <- function(FUN) {
>>>>     function(...) pvec3(FUN=FUN, ...)
>>>> }
>>>
>>>
>>> would be great to have these as 'pull' requests in github; pvec3 as a
>>> replacement for pvec, if it's implementing the same concept but better.
>>>
>>> Unit tests would be good (yes being a little hypocritical).
>>> inst/unitTests, using RUnit, examples in
>>>
>>>
>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges/inst/unitTests
>>>
>>>
>>> with username / password readonly
>>>
>>> Martin
>>>
>>>> On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:
>>>>>
>>>>>
>>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan <mtmorgan at fhcrc.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Interested developers -- I added the start of a BiocParallel
>>>>>> package to
>>>>>> the Bioconductor subversion repository and build system.
>>>>>>
>>>>>> The package is mirrored on github to allow for social coding; I
>>>>>> encourage
>>>>>> people to contribute via that mechanism.
>>>>>>
>>>>>>
>>>>>> https://github.com/**Bioconductor/BiocParallel<https://github.com/Bioconductor/BiocParallel>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The purpose is to help focus our efforts at developing appropriate
>>>>>> parallel paradigms. Currently the package Imports: parallel and
>>>>>> implements
>>>>>> pvec and mclapply in a way that allows for operation on any vector
>>>>>> or list
>>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in
>>>>>> particular seems to be appropriate for GRanges-like objects, where
>>>>>> we don't
>>>>>> necessarily want to extract many thousands of S4 instances of
>>>>>> individual
>>>>>> ranges with [[.
>>>>>>
>>>>>
>>>>>
>>>>> Makes sense. Besides, [[ does not even work on GRanges. One
>>>>> limitation of
>>>>> pvec is that it does not support a chunk size; it just uses length(x) /
>>>>> ncores. It would be nice to be able to restrict that, which would then
>>>>> require multiple jobs per core. Unless I'm missing something.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Hopefully the ideas in the package can be folded back in to
>>>>>> parallel as
>>>>>> they mature.
>>>>>>
>>>>>> Martin
>>>>>> --
>>>>>> Dr. Martin Morgan, PhD
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109



More information about the Bioc-devel mailing list