[Bioc-devel] BiocParallel -- update

Martin Morgan mtmorgan at fhcrc.org
Tue Dec 4 02:32:52 CET 2012


Bioc Developers --

BiocParallel generated quite a bit of discussion, so I'm providing a brief 
update. Version 0.0.5 is available to R-devel users via biocLite; it's in svn

   https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/BiocParallel

and github

   https://github.com/Bioconductor/BiocParallel

We have tried to incorporate some key ideas, though things are far from complete.


The basic idea is that one creates a 'param'

   p = MulticoreParam(workers=8)

and uses that in computations

   bplapply(1:8, function(i) Sys.sleep(1), param=p)


There is a simple registry, populated at start-up with a 'greedy' (e.g., 
MulticoreParam(workers=parallel::detectCores()) param instance or invoked explicitly

   register(p)

the 'default' (most recently register'ed, with default=TRUE argument) is used if 
param is missing

   bplapply(1:8, function(i) Sys.sleep(1))


There are MulticoreParam, SnowParam, and DoparParam params so far; SnowParam is 
'lazy' and bpstart / bpstop can be used to start the implied cluster

 > p = SnowParam(workers=2)
 > p = bpstart(p)
Bioconductor version 2.12 (BiocInstaller 1.9.5), ?biocLite for help
Bioconductor version 2.12 (BiocInstaller 1.9.5), ?biocLite for help
 > p = bpstop(p)

DoparParam (currently) indicates that a foreach-style back-end has been 
registered (via standard foreach approaches), and bplapply(1:8, ..., 
param=DoparParam()) uses foreach for evaluation. *Param are S4 classes (should 
probably be reference classes) that  extend BiocParallelParam and so anyone can 
implement a new *Param; eventually BiocParallelParam will define 'required' 
fields (like 'workers' and 'setSeed') that all *Param objects are expected to 
support.


bplapply has signature bplapply(X, FUN, ..., param) and is a generic in all 
three arguments, so again package developers can implement versions tailored to 
their clusters (Florian has sent me some code for an SGE scheduler, which I have 
not yet incorporated).


Only bplapply and bpvec are currently implemented as 'algorithms'. They have a 
common signature and have been implemented to rely only on length, '[', '[[' 
(for bplapply) and 'c' (for bpvec); this is the 'contract' that we'll try to 
maintain. We'd like to implement other algorithms, and to make current 
algorithms more useful by including better error handling, scheduling, and 
reduction.


bpvectorize is a simple way to convert 'vectorized' functions into a parallel, 
vectorized version, e.g., pcountOverlaps = bpvectorize(countOverlaps).


I'm happy to hear of major mis-steps, and areas in pressing need of development, 
either on or off list or via the github interface.


Ryan Thompson has made valuable contributions, especially DoparParam and 
cleaning up bpvec and bplapply; I haven't always managed to wrangle git and svn 
(thanks Laurent for the --add-author-name tip, which works when I do other 
things right) in a way that fully credits his contribution, for which I apologize.

Martin
-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list