[Bioc-devel] BiocParallel

Tue Dec 4 05:46:50 CET 2012

Picking up this thread in lack of other places (= were should
BiocParallel be discussed?)

I saw Martin's updates on the BiocParallel - great.  Florian's SGE
scheduler was also mentioned; is that one built on top of BatchJobs?
If so I'd be interested in looking into that/generalizing that to work
with any BatchJobs scheduler.

I believe there is going to be a new release of BatchJobs rather soon,
so it's probably worth waiting until that is available.

The main use case I'm interested in is to launch batch jobs on a
PBS/Torque cluster, and then use multicore processing on each compute
node.  It would be nice to be able to do this using the BiocParallel
model, but maybe it is too optimistic to get everything to work under
same model.  Also, as Vince hinted, fault tolerance etc needs to be
addressed and needs to be addressed differently in the different
setups.

/Henrik

On Tue, Nov 20, 2012 at 6:59 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com> wrote:
>
>
>
> On Sat, 17 Nov 2012 13:05:29 -0800,"Ryan C. Thompson" <rct at thompsonclan.org> wrote:
>
>> On 11/17/2012 02:39 AM, Ramon Diaz-Uriarte wrote:
>> > In addition to Steve's comment, is it really a good thing that "all code
>> > stays the same."?  I mean, multiple machines vs. multiple cores are,
>> > often, _very_ different things: for instance, shared vs. distributed
>> > memory, communication overhead differences, whether or not you can assume
>> > packages and objects to be automagically present in the slaves/child
>> > process, etc. So, given they are different situations, I think it
>> > sometimes makes sense to want to write different code for each situation
>> > (I often do); not to mention Steve's hybrid cases ;-).
>> >
>> >
>> > Since BiocParallel seems to be a major undertaking, maybe it would be
>> > appropriate to provide a flexible approach, instead of hard wiring the
>> > foreach approach.
>> Of course there are cases where the same code simply can't work for both
>> multicore and multi-machine situations, but those generally don't fall
>> into the category of things that can be done using lapply. Lapply and
>> all of its parallelized buddies like mclapply, parLapply, and foreach
>> are designed for data-parallel operations with no interdependence
>> between results, and these kinds of operations generally parallelize as
>> well across machines as across cores, unless your network is not fast
>> enough (in which case you would choose not to use multi-machine
>> parallelism). If you want a parallel algorithm for something like the
>> disjoin method of GRanges, you might need to write some special purpose
>> code, and that code might be very different for multicore vs multi-machine.
>
>> So yes, sometimes there is a fundamental reason that you have to change
>> the code to make it run on multiple machines, and neither foreach nor
>> any other parallelization framework will save you from having to rewrite
>> your code. But often there is no fundamental reason that the code has to
>> change, but you end up changing it anyway because of limitations in your
>> parallelization framework. This is the case that foreach saves you from.
>
>
>
> Hummm... I guess you are right, and we are talking about "often" or "most
> of the time", which is where all this would fit. Point taken.
>
>
> Best,
>
> R.
>
>
>
>
>
>
> --
> Ramon Diaz-Uriarte
> Department of Biochemistry, Lab B-25
> Facultad de Medicina
> Universidad Autónoma de Madrid
> Arzobispo Morcillo, 4
> 28029 Madrid
> Spain
>
> Phone: +34-91-497-2412
>
> Email: rdiaz02 at gmail.com
>        ramon.diaz at iib.uam.es
>
> http://ligarto.org/rdiaz
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel