[Bioc-devel] BiocParallel

Hahne, Florian florian.hahne at novartis.com
Fri Nov 16 09:18:47 CET 2012


I've hacked up some code that uses BatchJobs but makes it look like a
normal parLapply operation. Currently the main R process is checking the
state of the queue in regular intervals and fetches results once a job has
finished. Seems to work quite nicely, although there certainly are more
elaborate ways to deal with the synchronous/asynchronous issue. Is that
something that could be interesting for the broader audience? I could add
the code to BiocParallel for folks to try it out.
The whole thing may be a dumb idea, but I find it kind of useful to be
able to start parallel jobs directly from R on our huge SGE cluster, have
the calling script wait for all jobs to finish and then continue with some
downstream computations, rather than having to manually check the job
status and start another script once the results are there.
Florian
-- 






On 11/15/12 9:38 PM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:

>On Thu, Nov 15, 2012 at 11:00 AM, Martin Morgan <mtmorgan at fhcrc.org>
>wrote:
>
>> On 11/15/2012 10:53 AM, Henrik Bengtsson wrote:
>>
>>> Is there any write up/discussion/plans on the various types of
>>> parallel computations out there:
>>>
>>> (1) one machine / multi-core/multi-threaded
>>> (2) multiple machines / multiple processes
>>> (3) batch / queue processing (on large compute clusters with many
>>>users).
>>> (4) ...
>>>
>>> Are we/you mainly focusing on (1) and (2)?
>>>
>>
>> open for discussion; 1 & 2 are a good starting point for current scope.
>> r-pbd.org is relevant for 3.
>>
>>
>We have all three of those configurations here, so I've been looking into
>ways to facilitate each of them. One interesting package is BatchJobs. It
>handles simple clusters via ssh, as well as large managed clusters via
>e.g.
>lsf.
>
>
>
>> Not sure how to best facilitate this conversation / prioritization on
>> github? if possible we should move the conversation there.
>>
>> Martin
>>
>>
>>
>>> /Henrik
>>>
>>> On Thu, Nov 15, 2012 at 6:21 AM, Kasper Daniel Hansen
>>> <kasperdanielhansen at gmail.com> wrote:
>>>
>>>> I'll second Ryan's patch (at least in principle).  When I parallelize
>>>> across multiple cores, I have always found mc.preschedule to be an
>>>> important option to expose (that, and the number of cores, is all I
>>>> use routinely).
>>>>
>>>> Kasper
>>>>
>>>> On Wed, Nov 14, 2012 at 7:14 PM, Ryan C. Thompson
>>>><rct at thompsonclan.org>
>>>> wrote:
>>>>
>>>>> I just submitted a pull request. I'll add tests shortly if I can
>>>>>figure
>>>>> out
>>>>> how to write them.
>>>>>
>>>>>
>>>>> On Wed 14 Nov 2012 03:50:36 PM PST, Martin Morgan wrote:
>>>>>
>>>>>>
>>>>>> On 11/14/2012 03:43 PM, Ryan C. Thompson wrote:
>>>>>>
>>>>>>>
>>>>>>> Here are two alternative implementations of pvec. pvec2 is just a
>>>>>>> simple rewrite
>>>>>>> of pvec to use mclapply. pvec3 then extends pvec2 to accept a
>>>>>>> specified chunk
>>>>>>> size or a specified number of chunks. If the number of chunks
>>>>>>>exceeds
>>>>>>> the number
>>>>>>> of cores, then multiple chunks will get run sequentially on each
>>>>>>> core. pvec3
>>>>>>> also exposes the "mc.prescheule" argument of mclapply, since that
>>>>>>>is
>>>>>>> relevant
>>>>>>> when there are more chunks than cores. Lastly, I provide a
>>>>>>> "pvectorize" function
>>>>>>> which can be called on a regular vectorized function to make it
>>>>>>>into
>>>>>>> a pvec'd
>>>>>>> version of itself. Usage is like: sqrt.parallel <-
>>>>>>>pvectorize(sqrt);
>>>>>>> sqrt.parallel(1:1000).
>>>>>>>
>>>>>>> pvec2 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent =
>>>>>>>FALSE,
>>>>>>>                     mc.cores = getOption("mc.cores", 2L),
>>>>>>>mc.cleanup =
>>>>>>> TRUE)
>>>>>>> {
>>>>>>>     env <- parent.frame()
>>>>>>>     cores <- as.integer(mc.cores)
>>>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>>>>
>>>>>>>     if(mc.set.seed) mc.reset.stream()
>>>>>>>
>>>>>>>     n <- length(v)
>>>>>>>     si <- splitIndices(n, cores)
>>>>>>>     res <- do.call(c,
>>>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>>>>                             mc.set.seed=mc.set.seed,
>>>>>>>                             mc.silent=mc.silent,
>>>>>>>                             mc.cores=mc.cores,
>>>>>>>                             mc.cleanup=mc.cleanup))
>>>>>>>     if (length(res) != n)
>>>>>>>       warning("some results may be missing, folded or caused an
>>>>>>> error")
>>>>>>>     res
>>>>>>> }
>>>>>>> pvec3 <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent =
>>>>>>>FALSE,
>>>>>>>                     mc.cores = getOption("mc.cores", 2L),
>>>>>>>mc.cleanup =
>>>>>>> TRUE,
>>>>>>>                     mc.preschedule=FALSE, num.chunks, chunk.size)
>>>>>>> {
>>>>>>>     env <- parent.frame()
>>>>>>>     cores <- as.integer(mc.cores)
>>>>>>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>>>     if(cores == 1L) return(FUN(v, ...))
>>>>>>>
>>>>>>>     if(mc.set.seed) mc.reset.stream()
>>>>>>>
>>>>>>>     n <- length(v)
>>>>>>>     if (missing(num.chunks)) {
>>>>>>>       if (missing(chunk.size)) {
>>>>>>>         num.chunks <- cores
>>>>>>>       } else {
>>>>>>>         num.chunks <- ceiling(n/chunk.size)
>>>>>>>       }
>>>>>>>     }
>>>>>>>     si <- splitIndices(n, num.chunks)
>>>>>>>     res <- do.call(c,
>>>>>>>                    mclapply(si, function(i) FUN(v[i], ...),
>>>>>>>                             mc.set.seed=mc.set.seed,
>>>>>>>                             mc.silent=mc.silent,
>>>>>>>                             mc.cores=mc.cores,
>>>>>>>                             mc.cleanup=mc.cleanup,
>>>>>>>                             mc.preschedule=mc.preschedule)**)
>>>>>>>     if (length(res) != n)
>>>>>>>       warning("some results may be missing, folded or caused an
>>>>>>> error")
>>>>>>>     res
>>>>>>> }
>>>>>>>
>>>>>>> pvectorize <- function(FUN) {
>>>>>>>     function(...) pvec3(FUN=FUN, ...)
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>>
>>>>>> would be great to have these as 'pull' requests in github; pvec3 as
>>>>>>a
>>>>>> replacement for pvec, if it's implementing the same concept but
>>>>>>better.
>>>>>>
>>>>>> Unit tests would be good (yes being a little hypocritical).
>>>>>> inst/unitTests, using RUnit, examples in
>>>>>>
>>>>>>
>>>>>> https://hedgehog.fhcrc.org/**bioconductor/trunk/madman/**
>>>>>> 
>>>>>>Rpacks/IRanges/inst/unitTests<https://hedgehog.fhcrc.org/bioconductor
>>>>>>/trunk/madman/Rpacks/IRanges/inst/unitTests>
>>>>>>
>>>>>>
>>>>>> with username / password readonly
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>  On Wed 14 Nov 2012 02:23:30 PM PST, Michael Lawrence wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 14, 2012 at 12:23 PM, Martin Morgan
>>>>>>>><mtmorgan at fhcrc.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Interested developers -- I added the start of a BiocParallel
>>>>>>>>> package to
>>>>>>>>> the Bioconductor subversion repository and build system.
>>>>>>>>>
>>>>>>>>> The package is mirrored on github to allow for social coding; I
>>>>>>>>> encourage
>>>>>>>>> people to contribute via that mechanism.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>https://github.com/****Bioconductor/BiocParallel<https://github.co
>>>>>>>>>m/**Bioconductor/BiocParallel>
>>>>>>>>> 
>>>>>>>>><http**s://github.com/Bioconductor/**BiocParallel<https://github.c
>>>>>>>>>om/Bioconductor/BiocParallel>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The purpose is to help focus our efforts at developing
>>>>>>>>>appropriate
>>>>>>>>> parallel paradigms. Currently the package Imports: parallel and
>>>>>>>>> implements
>>>>>>>>> pvec and mclapply in a way that allows for operation on any
>>>>>>>>>vector
>>>>>>>>> or list
>>>>>>>>> supporting length(), [, and [[ (the latter for mclapply). pvec in
>>>>>>>>> particular seems to be appropriate for GRanges-like objects,
>>>>>>>>>where
>>>>>>>>> we don't
>>>>>>>>> necessarily want to extract many thousands of S4 instances of
>>>>>>>>> individual
>>>>>>>>> ranges with [[.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Makes sense. Besides, [[ does not even work on GRanges. One
>>>>>>>> limitation of
>>>>>>>> pvec is that it does not support a chunk size; it just uses
>>>>>>>> length(x) /
>>>>>>>> ncores. It would be nice to be able to restrict that, which would
>>>>>>>> then
>>>>>>>> require multiple jobs per core. Unless I'm missing something.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hopefully the ideas in the package can be folded back in to
>>>>>>>>> parallel as
>>>>>>>>> they mature.
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>> --
>>>>>>>>> Dr. Martin Morgan, PhD
>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>>
>>>>>>>>> ______________________________****_________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.
>>>>>>>>>ethz.ch/mailman/**listinfo/bioc-devel>
>>>>>>>>> 
>>>>>>>>><https://**stat.ethz.ch/mailman/listinfo/**bioc-devel<https://stat
>>>>>>>>>.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________**_________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> 
>>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.eth
>>>>>>>>z.ch/mailman/listinfo/bioc-devel>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> 
>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>h/mailman/listinfo/bioc-devel>
>>>>>
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>> 
>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>/mailman/listinfo/bioc-devel>
>>>>
>>>
>>> ______________________________**_________________
>>> Bioc-devel at r-project.org mailing list
>>> 
>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/
>>>mailman/listinfo/bioc-devel>
>>>
>>>
>>
>> --
>> Dr. Martin Morgan, PhD
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>
>	[[alternative HTML version deleted]]
>
>_______________________________________________
>Bioc-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list