[Bioc-devel] parallel package generics

Hahne, Florian florian.hahne at novartis.com
Thu Oct 25 20:12:40 CEST 2012


For me the cleanest option with the least impact would be to have this
fixed directly in the parallel package. However I think that somebody with
more influence should suggest that to Rdevel.
If they will not do it, the other options seem all more or less equivalent
to me.
Florian
-- 






On 10/25/12 12:08 AM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

>Hi,
>
>With Florian use case, there seems to be a strong/immediate need for
>dispatching on the cluster-like object passed as the 1st argument to
>parLapply() and all the other functions in the parallel package that
>belong to the "snow family" (14 functions in total, all documented in
>?parallel::parLapply). So we've just added those 14 generics to
>BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
>mclapply(), mcmapply(), and pvec()) for now.
>
>Note that the 14 new generics dispatch at least on their 1st argument
>('cl'), but also on their 2nd argument when this argument is 'x', 'X'
>or 'seq' (expected to be a vector-like or matrix-like object). This
>opens the door to defining methods that take advantage of the of the
>implementation of particular vector-like or matrix-like objects.
>
>Also note that, even if some of the 14 functions in the "snow family"
>are simple convenience wrappers to other functions in the family, we've
>made all of them generics. For example clusterEvalQ() is a simple
>wrapper to clusterCall():
>
>   > clusterEvalQ
>   function (cl = NULL, expr)
>   clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
>   <environment: namespace:parallel>
>
>And it seems (at least intuitively) that implementing a "clusterCall"
>method for my cluster-like objects should be enough to have
>clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
>this is not the case:
>
>   setClass("FakeCluster", representation(nnodes="integer"))
>
>   setMethod("clusterCall", "FakeCluster",
>       function (cl=NULL, fun, ...) fun(...)
>   )
>
>Then:
>
>   > mycluster <- new("FakeCluster", nnodes=10L)
>   > clusterCall(mycluster, print, 1:6)
>   [1] 1 2 3 4 5 6
>   > clusterEvalQ(mycluster, print(1:6))
>   Error in checkCluster(cl) : not a valid cluster
>
>This is because the "clusterEvalQ" default method is calling
>parallel::clusterCall() (which is *not* the generic), instead of
>calling BiocGenerics::clusterCall() (which *is* the generic).
>
>This would be avoided if clusterCall() was a generic defined in
>the parallel package itself (or in a package that parallel depends
>on). And this would of course be a better solution than having those
>generics in BiocGenerics. Is someone willing to bring that case to
>R-devel?
>
>In the mean time I need to define a "clusterEvalQ" method:
>
>   setMethod("clusterEvalQ", "FakeCluster",
>       function (cl=NULL, expr)
>           clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
>   )
>
>And then:
>
>   > clusterEvalQ(mycluster, print(1:6))
>   [1] 1 2 3 4 5 6
>
>Finally note that this method I defined for my objects could be made the
>default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we
>could put it in BiocGenerics. Or, since there is apparently nothing to
>win by having clusterEvalQ() being a generic in the first place, we
>could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
>This function would be implemented *exactly* like
>parallel::clusterEvalQ() (and it would mask it), except that now
>it would call BiocGenerics::clusterCall() internally.
>
>What should we do?
>
>H.
>
>
>On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
>> On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com>
>>wrote:
>>
>>> I agree that it would fruitful to have parLapply in BiocGenerics. It
>>>looks
>>> to be a flexible abstraction and its presence in the parallel package
>>> makes
>>> it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
>>> would be good candidates, as well. The fork-based parallelism is
>>> substantively different in terms of the API from the more general
>>> parallelism of parLapply.
>>>
>>> Someone was working on some more robust and convenient wrappers around
>>> mclapply. Did that ever see the light of day?
>>
>>
>> If you are referring to
>> 
>>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>>
>> in which I had offered some small changes to parallel::pvec
>>
>> 	https://gist.github.com/3757873/
>>
>> and after which Martin had provided me with a route I have not (yet?)
>> followed toward submitting a patch to R for consideration by R-devel /
>> Simon Urbanek in
>>
>> 
>>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
>>th
>> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>>
>>
>>
>>
>>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com**> wrote:
>>>>>
>>>>>   In response to a question from yesterday, I pointed someone to the
>>>>>> ShortRead `srapply` function and I wondered to myself why it had to
>>>>>> necessarily by "burried" in the ShortRead package (aside from it
>>>>>> having a `sr` prefix).
>>>>>>
>>>>>
>>>> I don't know that srapply necessarily 'got it right'...
>>
>>
>> One thing I like about srapply is its support for a reduce argument.
>>
>>>>>> I had thought it might be a good idea to move that (or something
>>>>>>like
>>>>>> that) to BiocGenerics (unless implementations aren't allowed there)
>>>>>> but also realized that it would add more dependencies where someone
>>>>>> might not necessarily need them.
>>
>>
>>>>>>
>>>>>> But, almost surely, a large majority of the people will be happy to
>>>>>>do
>>>>>> some form of ||-ization, so in my mind it's not such an onerous
>>>>>>thing
>>>>>> to add -- on the other hand, this large majority is probably
>>>>>>enriched
>>>>>> for people who are doing NGS analysis, in which case, keeping it in
>>>>>> ShortRead can make some sense.
>>
>> I remain confused about the need for putting any of this into
>>BiocGenerics
>> at all.  It seems to me that properly construed parallization primitives
>> ought to 'just work' with any object which supports indexing and length.
>>
>> I would appreciate hearing arguments to the contrary.
>>
>> Florian, in a similar vein, could we not seek to change
>> parallel::makeCluster to be extensible to, say, support SGE cluster?
>>THis
>> seems like the 'right thing to do'.  ???
>>
>>
>> Regardless, I think we have raised some considerations that might inform
>> improvements to parallel, including points about error handling,
>>reducing
>> results, block-level parallization over List/Vector (in addition to
>> vector), etc.
>>
>> I think perhaps having a google doc that we can collectively edit to
>> contain the requirements we are trying to achieve might move us forward
>> effectively. Would this help? Or perhaps a page under
>> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>>
>>
>>>>>> Taking one step back, I recall some chatter last week (or two) about
>>>>>> some better ||-ization "primitives" -- something about a pvec
>>>>>>doo-dad,
>>>>>> and there being ideas to wrap different types of ||-ization behind
>>>>>>an
>>>>>> easy to use interface (I think this was the convo), and then I took
>>>>>>a
>>>>>> further step back and often wonder why we just don't bite the bullet
>>>>>> and take advantage of the `foreach` infrastructure that is already
>>>>>>out
>>>>>> there -- in which case, I could imagne a "doSGE" package that might
>>>>>> handle the particulars of what Florain is referring to. You could
>>>>>>then
>>>>>> configure it externally via some
>>>>>>`registerDoSGE(some.config.**object)`
>>>>>> and just have the package code happily run it through `foreach(...)
>>>>>> %dopar%` and be done w/ it.
>>>>>>
>>>>>>
>>>>>>   IMHO it is relevant.  I have not looked for other abstractions,
>>>>>>and
>>>>>> this
>>>>> one seems
>>>>> to work.  Florian's objectives might be a good test case for
>>>>>adequacy.
>>>>>
>>>>
>>>> The registerDoDah does seem to be a useful abstraction.
>>
>> Is this not more-or-less the intention of parallel::setDefaultCluster?
>>
>> --Malcolm
>>
>>
>>
>>>>
>>>> I think there's a lot of work to do for some sort of coordinated
>>>> parallelization that putting parLapply into BiocGenerics might
>>>> encourage;
>>>> not good things will happen when everyone in a call stack tries to
>>>> parallelize independently. But I'm in favor of parLapply in
>>>> BiocGenerics at
>>>> least for the moment.
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>>
>>>>>   ... at least, I thought this is what was being talked about here
>>>>>(and
>>>>>> popped up a week or two ago) -- sorry if I completely missed the
>>>>>>mark
>>>>>> ...
>>>>>>
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>>>>>> <florian.hahne at novartis.com> wrote:
>>>>>>
>>>>>>> Hi Martin,
>>>>>>> I could define the generics in my own package, but that would mean
>>>>>>> that
>>>>>>> those will only be available there, or in the global environment
>>>>>>> assuming
>>>>>>> that I also export them, or in all additional packages that
>>>>>>> explicitly
>>>>>>> import them from my name space. Now there already are a whole bunch
>>>>>>> of
>>>>>>> packages around that all allow for parallelization via a cluster
>>>>>>> object.
>>>>>>> Obviously those all import the parLapply function from the parallel
>>>>>>> package. That means that I can't simply supply my own modified
>>>>>>> cluster
>>>>>>> object, because the code that calls parLapply will not know about
>>>>>>>the
>>>>>>> generic in my package, even if it is attached. Ideally parLapply
>>>>>>> would
>>>>>>> be
>>>>>>> a generic function already in the parallel package. Not sure who
>>>>>>> needs
>>>>>>> to
>>>>>>> be convinced in order for this to happen, but my gut feeling was
>>>>>>> that it
>>>>>>> could be easier to have the generic in BiocGenerics.
>>>>>>> Maybe I am missing something obvious here, but imo there is no way
>>>>>>>to
>>>>>>> overwrite parLapply globally for my own class unless the generic is
>>>>>>> imported by everyone who wants to make use of the special method.
>>>>>>> Florian
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>>>>>>>
>>>>>>>   On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I was wondering whether it would be possible to have proper
>>>>>>>>> generics
>>>>>>>>>
>>>>>>>> for
>>>>>>
>>>>>>> some of the functions in the parallel package, e.g. parLapply and
>>>>>>>>> clusterCall. The reason I am asking is because I want to build an
>>>>>>>>> S4
>>>>>>>>> class
>>>>>>>>> that essentially looks like an S3 cluster object but knows how to
>>>>>>>>> deal
>>>>>>>>> with the SGE. That way I can abstract away all the overhead
>>>>>>>>> regarding
>>>>>>>>> job
>>>>>>>>> submission, job status and reducing the results in the parLapply
>>>>>>>>> method
>>>>>>>>> of
>>>>>>>>> that class, and would be able to supply this new cluster object
>>>>>>>>>to
>>>>>>>>> all
>>>>>>>>> of
>>>>>>>>> my existing functions that can be processed in parallel using a
>>>>>>>>> cluster
>>>>>>>>> object as input. I have played around with the BatchJobs package
>>>>>>>>> as an
>>>>>>>>> abstraction layer to SGE and that work nicely. As a test case I
>>>>>>>>> have
>>>>>>>>> created the necessary generics myself in order to supply my own
>>>>>>>>> SGEcluster
>>>>>>>>> object to a function that normally deals with the "regular"
>>>>>>>>> parallel
>>>>>>>>> package S3 cluster objects and everything just worked out of the
>>>>>>>>> box,
>>>>>>>>> but
>>>>>>>>> obviously this fails once I am in a name space and my generic is
>>>>>>>>> not
>>>>>>>>> found
>>>>>>>>> anymore. Of course what we would really want is some proper
>>>>>>>>> abstraction
>>>>>>>>> of
>>>>>>>>> parallelization in R, but for now this seem to be at least a
>>>>>>>>>cheap
>>>>>>>>> compromise. Any thoughts on this?
>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Florian -- we talked about this locally, but I guess we didn't
>>>>>>>> actually send
>>>>>>>> any email!
>>>>>>>>
>>>>>>>> Is there an obstacle to promoting these to generics in your own
>>>>>>>> package?
>>>>>>>> The
>>>>>>>> usual motivation for inclusion in BiocGenerics has been to avoid
>>>>>>>> conflicts
>>>>>>>> between packages, but I'm not sure whether this is the case (yet)?
>>>>>>>> This
>>>>>>>> would
>>>>>>>> also add a dependency fairly deep in the hierarchy.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>   Florian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>>>
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>
>>>>>>> 
>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
>>>>>>>.c
>>>>>>> h/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>>    | Memorial Sloan-Kettering Cancer Center
>>>>>>    | Weill Medical College of Cornell University
>>>>>> Contact Info:
>>>>>> 
>>>>>>http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian
>>>>>>os
>>>>>> /contact>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>
>>>>>> 
>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
>>>>>>ch
>>>>>> /mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>>
>>>>> 
>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>h/
>>>>> mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>>
>>>> 
>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>/m
>>>> ailman/listinfo/bioc-devel>
>>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>-- 
>Hervé Pagès
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone:  (206) 667-5791
>Fax:    (206) 667-1319



More information about the Bioc-devel mailing list