[Bioc-devel] BiocParallel

Henrik Bengtsson hb at biostat.ucsf.edu
Tue Dec 4 21:38:04 CET 2012


Thanks.

On Tue, Dec 4, 2012 at 3:47 AM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
> I have been booked up so no chance to deploy but I do have access to SGE and
> LSF so will try and will report ASAP.

...and I'll try it out on PBS (... but I most likely won't have time
to do this until the end of the year).

Henrik

>
>
> On Tue, Dec 4, 2012 at 4:08 AM, Hahne, Florian <florian.hahne at novartis.com>
> wrote:
>>
>> Hi Henrik,
>> I have now come up now with a relatively generic version of this
>> SGEcluster approach. It does indeed use BatchJobs under the hood and
>> should thus support all available cluster queues, assuming that the
>> necessary batchJobs routines are available. I could only test this on our
>> SGE cluster, but Vince wanted to try other queuing systems. Not sure how
>> far he got. For now the code is wrapped in a little package called
>> Qcluster with some documentation. If you want to I can send you a version
>> in a separate mail. Would be good to test this on other systems, and I am
>> sure there remain some bugs that need to be ironed out. In particular the
>> fault tolerance you mentioned needs to be addressed properly. Currently
>> the code may leave unwanted garbage if things fail in the wrong places
>> because all the communication is file-based.
>> Martin, I'll send you my updated version in case you want to include this
>> in biocParallel for others to contribute.
>> Florian
>> --
>>
>>
>>
>>
>>
>>
>> On 12/4/12 5:46 AM, "Henrik Bengtsson" <hb at biostat.ucsf.edu> wrote:
>>
>> >Picking up this thread in lack of other places (= were should
>> >BiocParallel be discussed?)
>> >
>> >I saw Martin's updates on the BiocParallel - great.  Florian's SGE
>> >scheduler was also mentioned; is that one built on top of BatchJobs?
>> >If so I'd be interested in looking into that/generalizing that to work
>> >with any BatchJobs scheduler.
>> >
>> >I believe there is going to be a new release of BatchJobs rather soon,
>> >so it's probably worth waiting until that is available.
>> >
>> >The main use case I'm interested in is to launch batch jobs on a
>> >PBS/Torque cluster, and then use multicore processing on each compute
>> >node.  It would be nice to be able to do this using the BiocParallel
>> >model, but maybe it is too optimistic to get everything to work under
>> >same model.  Also, as Vince hinted, fault tolerance etc needs to be
>> >addressed and needs to be addressed differently in the different
>> >setups.
>> >
>> >/Henrik
>> >
>> >On Tue, Nov 20, 2012 at 6:59 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>
>> >wrote:
>> >>
>> >>
>> >>
>> >> On Sat, 17 Nov 2012 13:05:29 -0800,"Ryan C. Thompson"
>> >><rct at thompsonclan.org> wrote:
>> >>
>> >>> On 11/17/2012 02:39 AM, Ramon Diaz-Uriarte wrote:
>> >>> > In addition to Steve's comment, is it really a good thing that "all
>> >>>code
>> >>> > stays the same."?  I mean, multiple machines vs. multiple cores are,
>> >>> > often, _very_ different things: for instance, shared vs. distributed
>> >>> > memory, communication overhead differences, whether or not you can
>> >>>assume
>> >>> > packages and objects to be automagically present in the slaves/child
>> >>> > process, etc. So, given they are different situations, I think it
>> >>> > sometimes makes sense to want to write different code for each
>> >>>situation
>> >>> > (I often do); not to mention Steve's hybrid cases ;-).
>> >>> >
>> >>> >
>> >>> > Since BiocParallel seems to be a major undertaking, maybe it would
>> >>> > be
>> >>> > appropriate to provide a flexible approach, instead of hard wiring
>> >>>the
>> >>> > foreach approach.
>> >>> Of course there are cases where the same code simply can't work for
>> >>>both
>> >>> multicore and multi-machine situations, but those generally don't fall
>> >>> into the category of things that can be done using lapply. Lapply and
>> >>> all of its parallelized buddies like mclapply, parLapply, and foreach
>> >>> are designed for data-parallel operations with no interdependence
>> >>> between results, and these kinds of operations generally parallelize
>> >>> as
>> >>> well across machines as across cores, unless your network is not fast
>> >>> enough (in which case you would choose not to use multi-machine
>> >>> parallelism). If you want a parallel algorithm for something like the
>> >>> disjoin method of GRanges, you might need to write some special
>> >>> purpose
>> >>> code, and that code might be very different for multicore vs
>> >>>multi-machine.
>> >>
>> >>> So yes, sometimes there is a fundamental reason that you have to
>> >>> change
>> >>> the code to make it run on multiple machines, and neither foreach nor
>> >>> any other parallelization framework will save you from having to
>> >>>rewrite
>> >>> your code. But often there is no fundamental reason that the code has
>> >>>to
>> >>> change, but you end up changing it anyway because of limitations in
>> >>>your
>> >>> parallelization framework. This is the case that foreach saves you
>> >>>from.
>> >>
>> >>
>> >>
>> >> Hummm... I guess you are right, and we are talking about "often" or
>> >>"most
>> >> of the time", which is where all this would fit. Point taken.
>> >>
>> >>
>> >> Best,
>> >>
>> >> R.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Ramon Diaz-Uriarte
>> >> Department of Biochemistry, Lab B-25
>> >> Facultad de Medicina
>> >> Universidad Autónoma de Madrid
>> >> Arzobispo Morcillo, 4
>> >> 28029 Madrid
>> >> Spain
>> >>
>> >> Phone: +34-91-497-2412
>> >>
>> >> Email: rdiaz02 at gmail.com
>> >>        ramon.diaz at iib.uam.es
>> >>
>> >> http://ligarto.org/rdiaz
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>



More information about the Bioc-devel mailing list