[Bioc-devel] BiocParallel -- update
Martin Morgan
mtmorgan at fhcrc.org
Tue Dec 4 02:32:52 CET 2012
Bioc Developers --
BiocParallel generated quite a bit of discussion, so I'm providing a brief
update. Version 0.0.5 is available to R-devel users via biocLite; it's in svn
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/BiocParallel
and github
https://github.com/Bioconductor/BiocParallel
We have tried to incorporate some key ideas, though things are far from complete.
The basic idea is that one creates a 'param'
p = MulticoreParam(workers=8)
and uses that in computations
bplapply(1:8, function(i) Sys.sleep(1), param=p)
There is a simple registry, populated at start-up with a 'greedy' (e.g.,
MulticoreParam(workers=parallel::detectCores()) param instance or invoked explicitly
register(p)
the 'default' (most recently register'ed, with default=TRUE argument) is used if
param is missing
bplapply(1:8, function(i) Sys.sleep(1))
There are MulticoreParam, SnowParam, and DoparParam params so far; SnowParam is
'lazy' and bpstart / bpstop can be used to start the implied cluster
> p = SnowParam(workers=2)
> p = bpstart(p)
Bioconductor version 2.12 (BiocInstaller 1.9.5), ?biocLite for help
Bioconductor version 2.12 (BiocInstaller 1.9.5), ?biocLite for help
> p = bpstop(p)
DoparParam (currently) indicates that a foreach-style back-end has been
registered (via standard foreach approaches), and bplapply(1:8, ...,
param=DoparParam()) uses foreach for evaluation. *Param are S4 classes (should
probably be reference classes) that extend BiocParallelParam and so anyone can
implement a new *Param; eventually BiocParallelParam will define 'required'
fields (like 'workers' and 'setSeed') that all *Param objects are expected to
support.
bplapply has signature bplapply(X, FUN, ..., param) and is a generic in all
three arguments, so again package developers can implement versions tailored to
their clusters (Florian has sent me some code for an SGE scheduler, which I have
not yet incorporated).
Only bplapply and bpvec are currently implemented as 'algorithms'. They have a
common signature and have been implemented to rely only on length, '[', '[['
(for bplapply) and 'c' (for bpvec); this is the 'contract' that we'll try to
maintain. We'd like to implement other algorithms, and to make current
algorithms more useful by including better error handling, scheduling, and
reduction.
bpvectorize is a simple way to convert 'vectorized' functions into a parallel,
vectorized version, e.g., pcountOverlaps = bpvectorize(countOverlaps).
I'm happy to hear of major mis-steps, and areas in pressing need of development,
either on or off list or via the github interface.
Ryan Thompson has made valuable contributions, especially DoparParam and
cleaning up bpvec and bplapply; I haven't always managed to wrangle git and svn
(thanks Laurent for the --add-author-name tip, which works when I do other
things right) in a way that fully credits his contribution, for which I apologize.
Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list