[Bioc-devel] BiocParallel -- update
Ryan C. Thompson
rct at thompsonclan.org
Tue Dec 4 22:44:23 CET 2012
By the way, all my work on BiocParallel is going to end up here:
If you want to read through the multicore-only pvectorize, it is here:
It's a little more than one line of code now. A lot of the code deals
with proper recycling in the case of multiple vectorized args and
merging the singature of the function with that of pvec, as well as
corner cases like not vectorizing anything and being passed length-1
There's also an mcmapply function that is to mapply as mclapply is to
lapply. I plan to implement a param-generic version called bpmapply,
which may become the backend for bpvectorize.
On Tue 04 Dec 2012 01:15:24 PM PST, Michael Lawrence wrote:
> On Tue, Dec 4, 2012 at 12:47 PM, Ryan C. Thompson
> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
> One issue that I see is that for some kinds of parallel backends,
> there may not be any way for "bpworkers" to return something
> meaningful. For example, a backend that submits jobs to a large
> cluster may not know exactly how many nodes are in the cluster,
> and in any case returning the total number of nodes may not be
> appropriate, since those nodes are shared with other cluster
> users. This is primarily important for the pvec function, which
> uses the result of bpworkers to decide how many chunks to split
> the input into.
> I guess one solution is to make sure that for any backend that
> cannot natively determine a number of available workers, we
> require the number of workers as an argument when creating the
> param object for that backend. e.g.:
> param <- IndeterminateSizedClusterParam__(workers=50).
> I think this is on the right track. Since the nature of the request
> affects how the jobs are scheduled (earlier or later), there's no way
> to automatically make the decision, even if we could detect the total
> cluster size. As I noted in the previous email, having a consistent
> means of specifying resource requests across backends would be helpful.
> I could see an API like:
> request <- ResourceRequest(num.cores = 5)
> cluster <- LSFCluster(request) # or MulticoreCluster(request)
> pvec(v, cluster = cluster)
> Depending on the cluster, the 'cluster' object could be queried for
> whether the requested resources are currently available (or the jobs
> will need to wait). A default cluster object could be registered in
> options(). The Cluster constructors could take the arguments of
> ResourceRequest directly for simple tasks.
> Then the question is whether pvec returns the result of evaluation, or
> the promise of evaluation. Probably best to have pvec always behave
> synchronously, then have variants like apvec() for asynchronous
> execution. The promise would be backend-specific and support status
> queries. For multicore, this is basically mcparallel/mccollect.
> Additionally, as discussed previously, it makes sense to be able
> to explicitly choose a chunk size or number of chunks for pvec,
> rather than splitting into exactly as many chunks as there are
> parallel workers. I implemented this in the non-generic
> multicore-only version of pvec, but I still need to port it to the
> generic version that works for any param. Do people think that the
> chunk options should be included in the MulticoreParam class, or
> specified when pvec is called?
> What about supporting both? If passed directly to pvec, the params
> option is overridden.
> I have also written a non-generic multicore-only version of
> pvectorize that allows for multiple vectorized arguments instead
> of just one, and furthermore gives the parallelized function an
> identical signature to the original function. Again, this needs to
> be ported to the generic bpvectorize.
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
More information about the Bioc-devel