[Bioc-devel] parallel package generics

Hahne, Florian florian.hahne at novartis.com
Thu Oct 25 12:17:39 CEST 2012


-- 






On 10/24/12 6:07 PM, "Cook, Malcolm" <MEC at stowers.org> wrote:

>On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com>
>wrote:
>
>>I agree that it would fruitful to have parLapply in BiocGenerics. It
>>looks
>>to be a flexible abstraction and its presence in the parallel package
>>makes
>>it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
>>would be good candidates, as well. The fork-based parallelism is
>>substantively different in terms of the API from the more general
>>parallelism of parLapply.
>>
>>Someone was working on some more robust and convenient wrappers around
>>mclapply. Did that ever see the light of day?
>
>
>If you are referring to
>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>
>in which I had offered some small changes to parallel::pvec
>
>	https://gist.github.com/3757873/
>
>and after which Martin had provided me with a route I have not (yet?)
>followed toward submitting a patch to R for consideration by R-devel /
>Simon Urbanek in
>
>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-ot
>h
>er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>
>
>
>
>>>>On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>>>> mailinglist.honeypot at gmail.com**> wrote:
>>>>
>>>>  In response to a question from yesterday, I pointed someone to the
>>>>> ShortRead `srapply` function and I wondered to myself why it had to
>>>>> necessarily by "burried" in the ShortRead package (aside from it
>>>>> having a `sr` prefix).
>>>>>
>>>>
>>> I don't know that srapply necessarily 'got it right'...
>
>
>One thing I like about srapply is its support for a reduce argument.
>
>>>>>I had thought it might be a good idea to move that (or something like
>>>>> that) to BiocGenerics (unless implementations aren't allowed there)
>>>>> but also realized that it would add more dependencies where someone
>>>>> might not necessarily need them.
>
>
>>>>>
>>>>> But, almost surely, a large majority of the people will be happy to
>>>>>do
>>>>> some form of ||-ization, so in my mind it's not such an onerous thing
>>>>> to add -- on the other hand, this large majority is probably enriched
>>>>> for people who are doing NGS analysis, in which case, keeping it in
>>>>> ShortRead can make some sense.
>
>I remain confused about the need for putting any of this into BiocGenerics
>at all.  It seems to me that properly construed parallization primitives
>ought to 'just work' with any object which supports indexing and length.

That may well be true, but currently we have a whole lot of legacy code
were folks opted to use one of the snow-family functions as the one and
only mode for parallel processing. My argument here is simply that we
could make all of these implementations much more flexible by allowing for
arbitrary cluster objects in there, at very little cost for the package
maintainers. Obviously we would ideally have a proper parallel abstraction
layer that everybody agrees to use, but currently we are still far far
away from that. Plus the problem of legacy code remains.

>
>I would appreciate hearing arguments to the contrary.
>
>Florian, in a similar vein, could we not seek to change
>parallel::makeCluster to be extensible to, say, support SGE cluster?  THis
>seems like the 'right thing to do'.  ???

Not necessarily. I can't see how this could easily be archived without
knowing all the possible cluster subtypes in advance. We could not turn it
into a generic and dispatch to different methods because the signatures
are not necessarily different. It seems to me that the paradigm at least
in Bioconductor for these cases is to have a class that encapsulates all
the necessary setting parameters, and a constructor function for this
class. Again, if we had something more generic in place we certainly would
not need all of this.

>
>
>Regardless, I think we have raised some considerations that might inform
>improvements to parallel, including points about error handling, reducing
>results, block-level parallization over List/Vector (in addition to
>vector), etc.
>
>I think perhaps having a google doc that we can collectively edit to
>contain the requirements we are trying to achieve might move us forward
>effectively. Would this help? Or perhaps a page under
>http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>
>
>>>>> Taking one step back, I recall some chatter last week (or two) about
>>>>> some better ||-ization "primitives" -- something about a pvec
>>>>>doo-dad,
>>>>> and there being ideas to wrap different types of ||-ization behind an
>>>>> easy to use interface (I think this was the convo), and then I took a
>>>>> further step back and often wonder why we just don't bite the bullet
>>>>> and take advantage of the `foreach` infrastructure that is already
>>>>>out
>>>>> there -- in which case, I could imagne a "doSGE" package that might
>>>>> handle the particulars of what Florain is referring to. You could
>>>>>then
>>>>> configure it externally via some
>>>>>`registerDoSGE(some.config.**object)`
>>>>> and just have the package code happily run it through `foreach(...)
>>>>> %dopar%` and be done w/ it.
>>>>>
>>>>>
>>>>>  IMHO it is relevant.  I have not looked for other abstractions, and
>>>>>this
>>>> one seems
>>>> to work.  Florian's objectives might be a good test case for adequacy.
>>>>
>>>
>>> The registerDoDah does seem to be a useful abstraction.
>
>Is this not more-or-less the intention of parallel::setDefaultCluster?
>
>--Malcolm
>
>
>
>>>
>>> I think there's a lot of work to do for some sort of coordinated
>>> parallelization that putting parLapply into BiocGenerics might
>>>encourage;
>>> not good things will happen when everyone in a call stack tries to
>>> parallelize independently. But I'm in favor of parLapply in
>>>BiocGenerics at
>>> least for the moment.
>>>
>>> Martin
>>>
>>>
>>>
>>>>
>>>>  ... at least, I thought this is what was being talked about here (and
>>>>> popped up a week or two ago) -- sorry if I completely missed the mark
>>>>> ...
>>>>>
>>>>> -steve
>>>>>
>>>>>
>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>>>>> <florian.hahne at novartis.com> wrote:
>>>>>
>>>>>> Hi Martin,
>>>>>> I could define the generics in my own package, but that would mean
>>>>>>that
>>>>>> those will only be available there, or in the global environment
>>>>>> assuming
>>>>>> that I also export them, or in all additional packages that
>>>>>>explicitly
>>>>>> import them from my name space. Now there already are a whole bunch
>>>>>>of
>>>>>> packages around that all allow for parallelization via a cluster
>>>>>>object.
>>>>>> Obviously those all import the parLapply function from the parallel
>>>>>> package. That means that I can't simply supply my own modified
>>>>>>cluster
>>>>>> object, because the code that calls parLapply will not know about
>>>>>>the
>>>>>> generic in my package, even if it is attached. Ideally parLapply
>>>>>>would
>>>>>> be
>>>>>> a generic function already in the parallel package. Not sure who
>>>>>>needs
>>>>>> to
>>>>>> be convinced in order for this to happen, but my gut feeling was
>>>>>>that it
>>>>>> could be easier to have the generic in BiocGenerics.
>>>>>> Maybe I am missing something obvious here, but imo there is no way
>>>>>>to
>>>>>> overwrite parLapply globally for my own class unless the generic is
>>>>>> imported by everyone who wants to make use of the special method.
>>>>>> Florian
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>>>>>>
>>>>>>  On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I was wondering whether it would be possible to have proper
>>>>>>>>generics
>>>>>>>>
>>>>>>> for
>>>>>
>>>>>> some of the functions in the parallel package, e.g. parLapply and
>>>>>>>> clusterCall. The reason I am asking is because I want to build an
>>>>>>>>S4
>>>>>>>> class
>>>>>>>> that essentially looks like an S3 cluster object but knows how to
>>>>>>>>deal
>>>>>>>> with the SGE. That way I can abstract away all the overhead
>>>>>>>>regarding
>>>>>>>> job
>>>>>>>> submission, job status and reducing the results in the parLapply
>>>>>>>> method
>>>>>>>> of
>>>>>>>> that class, and would be able to supply this new cluster object to
>>>>>>>>all
>>>>>>>> of
>>>>>>>> my existing functions that can be processed in parallel using a
>>>>>>>> cluster
>>>>>>>> object as input. I have played around with the BatchJobs package
>>>>>>>>as an
>>>>>>>> abstraction layer to SGE and that work nicely. As a test case I
>>>>>>>>have
>>>>>>>> created the necessary generics myself in order to supply my own
>>>>>>>> SGEcluster
>>>>>>>> object to a function that normally deals with the "regular"
>>>>>>>>parallel
>>>>>>>> package S3 cluster objects and everything just worked out of the
>>>>>>>>box,
>>>>>>>> but
>>>>>>>> obviously this fails once I am in a name space and my generic is
>>>>>>>>not
>>>>>>>> found
>>>>>>>> anymore. Of course what we would really want is some proper
>>>>>>>> abstraction
>>>>>>>> of
>>>>>>>> parallelization in R, but for now this seem to be at least a cheap
>>>>>>>> compromise. Any thoughts on this?
>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Florian -- we talked about this locally, but I guess we didn't
>>>>>>> actually send
>>>>>>> any email!
>>>>>>>
>>>>>>> Is there an obstacle to promoting these to generics in your own
>>>>>>> package?
>>>>>>> The
>>>>>>> usual motivation for inclusion in BiocGenerics has been to avoid
>>>>>>> conflicts
>>>>>>> between packages, but I'm not sure whether this is the case (yet)?
>>>>>>>This
>>>>>>> would
>>>>>>> also add a dependency fairly deep in the hierarchy.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>  Florian
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>>>>>>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> 
>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
>>>>>>c
>>>>>>h/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>   | Memorial Sloan-Kettering Cancer Center
>>>>>   | Weill Medical College of Cornell University
>>>>> Contact Info:
>>>>>http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Eliano
>>>>>s
>>>>>/contact>
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> 
>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>h
>>>>>/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>> 
>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>/
>>>>mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> ______________________________**_________________
>>> Bioc-devel at r-project.org mailing list
>>> 
>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/
>>>m
>>>ailman/listinfo/bioc-devel>
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>>_______________________________________________
>>Bioc-devel at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list