[Bioc-devel] parallel package generics
Hervé Pagès
hpages at fhcrc.org
Thu Oct 25 20:41:57 CEST 2012
On 10/25/2012 10:36 AM, Vincent Carey wrote:
> if R-core (who afaics maintain parallel) are unwilling to adopt/maintain
> these suggestions, why not write a biocParallel and/or cookParallel
> package that does? it seems to me that any interested party can pose
> the issue to r-devel. if no answer is given, we can all learn from the
> experimental alternate package.
This is certainly worth thinking about. IMO it helps to make the analogy
with the DBI world where RMySQL, RSQLIte, RPostgreSQL etc... are
plugins that implement DBI-compliant specific back-ends. With this
analogy, BiocParallel (or whatever we call it, cookParallel?) would be
the analog of DBI but for cluster back-ends. It could provide
built-in support for SNOW clusters but would also make it easy for
people to write a BiocParallel-compliant package that implements a
specific back-end.
That being said, it feels that the parallel package should be that
BiocParallel package. That is, it should provide the clean parallel
abstraction layer that we are aiming for and provide built-in
support for SNOW clusters (which it currently does). And the only
thing R-core would need to do to make this happen is turn some of
their functions into generics.
H.
>
> On Thu, Oct 25, 2012 at 12:44 PM, Cook, Malcolm <MEC at stowers.org
> <mailto:MEC at stowers.org>> wrote:
>
>
>
> On 10/24/12 5:08 PM, "Hervé Pagès" <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> >Hi,
> >
> >With Florian use case, there seems to be a strong/immediate need for
> >dispatching on the cluster-like object passed as the 1st argument to
> >parLapply() and all the other functions in the parallel package that
> >belong to the "snow family" (14 functions in total, all documented in
> >?parallel::parLapply). So we've just added those 14 generics to
> >BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
> >mclapply(), mcmapply(), and pvec()) for now.
> >
> >Note that the 14 new generics dispatch at least on their 1st argument
> >('cl'), but also on their 2nd argument when this argument is 'x', 'X'
> >or 'seq' (expected to be a vector-like or matrix-like object). This
> >opens the door to defining methods that take advantage of the of the
> >implementation of particular vector-like or matrix-like objects.
> >
> >Also note that, even if some of the 14 functions in the "snow family"
> >are simple convenience wrappers to other functions in the family,
> we've
> >made all of them generics. For example clusterEvalQ() is a simple
> >wrapper to clusterCall():
> >
> > > clusterEvalQ
> > function (cl = NULL, expr)
> > clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
> > <environment: namespace:parallel>
> >
> >And it seems (at least intuitively) that implementing a "clusterCall"
> >method for my cluster-like objects should be enough to have
> >clusterEvalQ() work out-of-the-box on those objects. But, sadly
> enough,
> >this is not the case:
> >
> > setClass("FakeCluster", representation(nnodes="integer"))
> >
> > setMethod("clusterCall", "FakeCluster",
> > function (cl=NULL, fun, ...) fun(...)
> > )
> >
> >Then:
> >
> > > mycluster <- new("FakeCluster", nnodes=10L)
> > > clusterCall(mycluster, print, 1:6)
> > [1] 1 2 3 4 5 6
> > > clusterEvalQ(mycluster, print(1:6))
> > Error in checkCluster(cl) : not a valid cluster
> >
> >This is because the "clusterEvalQ" default method is calling
> >parallel::clusterCall() (which is *not* the generic), instead of
> >calling BiocGenerics::clusterCall() (which *is* the generic).
> >
> >This would be avoided if clusterCall() was a generic defined in
> >the parallel package itself (or in a package that parallel depends
> >on). And this would of course be a better solution than having those
> >generics in BiocGenerics. Is someone willing to bring that case to
> >R-devel?
> >
> >In the mean time I need to define a "clusterEvalQ" method:
> >
> > setMethod("clusterEvalQ", "FakeCluster",
> > function (cl=NULL, expr)
> > clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
> > )
> >
> >And then:
> >
> > > clusterEvalQ(mycluster, print(1:6))
> > [1] 1 2 3 4 5 6
> >
> >Finally note that this method I defined for my objects could be
> made the
> >default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method)
> and we
> >could put it in BiocGenerics. Or, since there is apparently nothing to
> >win by having clusterEvalQ() being a generic in the first place, we
> >could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
> >This function would be implemented *exactly* like
> >parallel::clusterEvalQ() (and it would mask it), except that now
> >it would call BiocGenerics::clusterCall() internally.
> >
> >What should we do?
>
> We have the identical problem already when we try to use parallel
> mcmapply
> on a BioC List (i.e. GRangesList).
>
> Witness:
>
> The casual user (ehrm, myself at least) expects that since I can
> 'lapply'
> on a BioC GRangesList (or any other List) that I should be able to
> mclapply on it.
>
> Sadly the casual user is wrong, and gets an error.
>
> Why?
>
> Because parallel::mclapply(X... calls as.list on X.
>
> Which yields 'Error in as.list.default : no method for coercing this S4
> class to a vector'
>
> But, you say, IRanges defines as.list for Lists, as can be
> demonstrated by
> calling as.list(myGRL) on a GRangesList.
>
> Here I yield the floor to someone who can explain why this is so, for I
> have not studied enough how
> namespaces/packages/symboltables/whatever work
> in R.
>
> Anyone?
>
> Regardless, one BAD workaround I found works is to snarf (tm) the source
> for mclapply, evaluate it in the global namespace, after prefixing all
> parallel internal functions with 'parallel:::'.
>
> AFter doing this, the modified mclapply works as one might expect.
>
> So, there is at least an issue regarding how method dispatch works
> across
> namespaces. Again I yield the floor, but, expect that it can be fixed.
>
> BUT, FURTHERMORE, MCLAPPLY SHOULD NOT COERCE X TO LIST ANYWAY
>
> Why? Because calling `as.list` incurs the overhead of (needlessly!?!)
> coercing this nice tight GRangesList into a base::list.
>
> There is NO REASON for it to be coercing X to a list at all. By my
> lights, mclapply only needs `length` and `seq_along` defined on X, which
> ARE ALREADY available to a GRangesList from Vector. Indeed, commenting
> out the X<-as.list(X) coercion in mclapply and, lo, it still works on a
> GRangesList as hoped, and on a 1000 element GRanges list takes ~18x less
> user time to mclapply(myGRL,length). (and even short just to use
> elementLengths, but that is not the point).
>
> In this case the solution appears to be to FIX the upstream package so
> that method dispatch works correctly (I expect that length and seq_along
> are only visible to my snarfed mclapply and would suffer from similar
> error without adressing the package issue).
>
> Indeed, similarly, in my proposed changed to parallel::pvec, I found a
> simple change that made it work with Vector as well as vector, since
> Vector implements `[` and `length`.
>
> I still think the solution to getting an SGE (et. al.) parallel back-end
> is to seek to improve the upstream package to make 'pluggable' for
> different parallel backends.
>
> I don't think I'm the right person to represent this to R-devel as
> obviously I am not schooled (yet!?!?) in the workings of
> S3/S4/signatures/methods/etc.
>
> Herve, I have a hunch that your 'In the mean time' solution is a
> workaround that has the potential to invite further confusion.
>
> Anyone, as a perhaps related issue, and as an opportunity to educate me,
> can you explain why untrace does NOT completely work on `lapply` (with
> BiocGenerics loaded). Viz:
>
> trace(lapply)
> untrace(lapply)
> IRanges(1,2)
> IRanges of length 1
> trace: lapply(dots, methods:::.class1)
> ....
>
>
> --Malcolm
>
>
>
>
>
>
> >
> >H.
> >
> >
> >On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
> >> On 10/24/12 12:44 AM, "Michael Lawrence"
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
> >>wrote:
> >>
> >>> I agree that it would fruitful to have parLapply in
> BiocGenerics. It
> >>>looks
> >>> to be a flexible abstraction and its presence in the parallel
> package
> >>> makes
> >>> it ubiquitous. If it hasn't been done already, mclapply (and
> mcmapply)
> >>> would be good candidates, as well. The fork-based parallelism is
> >>> substantively different in terms of the API from the more general
> >>> parallelism of parLapply.
> >>>
> >>> Someone was working on some more robust and convenient wrappers
> around
> >>> mclapply. Did that ever see the light of day?
> >>
> >>
> >> If you are referring to
> >>
> >>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
> >>
> >> in which I had offered some small changes to parallel::pvec
> >>
> >> https://gist.github.com/3757873/
> >>
> >> and after which Martin had provided me with a route I have not
> (yet?)
> >> followed toward submitting a patch to R for consideration by
> R-devel /
> >> Simon Urbanek in
> >>
> >>
> >>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
> >>th
> >> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
> >>
> >>
> >>
> >>
> >>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
> >>>>> mailinglist.honeypot at gmail.com
> <mailto:mailinglist.honeypot at gmail.com>**> wrote:
> >>>>>
> >>>>> In response to a question from yesterday, I pointed someone
> to the
> >>>>>> ShortRead `srapply` function and I wondered to myself why it
> had to
> >>>>>> necessarily by "burried" in the ShortRead package (aside from it
> >>>>>> having a `sr` prefix).
> >>>>>>
> >>>>>
> >>>> I don't know that srapply necessarily 'got it right'...
> >>
> >>
> >> One thing I like about srapply is its support for a reduce argument.
> >>
> >>>>>> I had thought it might be a good idea to move that (or something
> >>>>>>like
> >>>>>> that) to BiocGenerics (unless implementations aren't allowed
> there)
> >>>>>> but also realized that it would add more dependencies where
> someone
> >>>>>> might not necessarily need them.
> >>
> >>
> >>>>>>
> >>>>>> But, almost surely, a large majority of the people will be
> happy to
> >>>>>>do
> >>>>>> some form of ||-ization, so in my mind it's not such an onerous
> >>>>>>thing
> >>>>>> to add -- on the other hand, this large majority is probably
> >>>>>>enriched
> >>>>>> for people who are doing NGS analysis, in which case,
> keeping it in
> >>>>>> ShortRead can make some sense.
> >>
> >> I remain confused about the need for putting any of this into
> >>BiocGenerics
> >> at all. It seems to me that properly construed parallization
> primitives
> >> ought to 'just work' with any object which supports indexing and
> length.
> >>
> >> I would appreciate hearing arguments to the contrary.
> >>
> >> Florian, in a similar vein, could we not seek to change
> >> parallel::makeCluster to be extensible to, say, support SGE cluster?
> >>THis
> >> seems like the 'right thing to do'. ???
> >>
> >>
> >> Regardless, I think we have raised some considerations that
> might inform
> >> improvements to parallel, including points about error handling,
> >>reducing
> >> results, block-level parallization over List/Vector (in addition to
> >> vector), etc.
> >>
> >> I think perhaps having a google doc that we can collectively edit to
> >> contain the requirements we are trying to achieve might move us
> forward
> >> effectively. Would this help? Or perhaps a page under
> >> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
> >>
> >>
> >>>>>> Taking one step back, I recall some chatter last week (or
> two) about
> >>>>>> some better ||-ization "primitives" -- something about a pvec
> >>>>>>doo-dad,
> >>>>>> and there being ideas to wrap different types of ||-ization
> behind
> >>>>>>an
> >>>>>> easy to use interface (I think this was the convo), and then
> I took
> >>>>>>a
> >>>>>> further step back and often wonder why we just don't bite
> the bullet
> >>>>>> and take advantage of the `foreach` infrastructure that is
> already
> >>>>>>out
> >>>>>> there -- in which case, I could imagne a "doSGE" package
> that might
> >>>>>> handle the particulars of what Florain is referring to. You
> could
> >>>>>>then
> >>>>>> configure it externally via some
> >>>>>>`registerDoSGE(some.config.**object)`
> >>>>>> and just have the package code happily run it through
> `foreach(...)
> >>>>>> %dopar%` and be done w/ it.
> >>>>>>
> >>>>>>
> >>>>>> IMHO it is relevant. I have not looked for other
> abstractions,
> >>>>>>and
> >>>>>> this
> >>>>> one seems
> >>>>> to work. Florian's objectives might be a good test case for
> >>>>>adequacy.
> >>>>>
> >>>>
> >>>> The registerDoDah does seem to be a useful abstraction.
> >>
> >> Is this not more-or-less the intention of
> parallel::setDefaultCluster?
> >>
> >> --Malcolm
> >>
> >>
> >>
> >>>>
> >>>> I think there's a lot of work to do for some sort of coordinated
> >>>> parallelization that putting parLapply into BiocGenerics might
> >>>> encourage;
> >>>> not good things will happen when everyone in a call stack tries to
> >>>> parallelize independently. But I'm in favor of parLapply in
> >>>> BiocGenerics at
> >>>> least for the moment.
> >>>>
> >>>> Martin
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> ... at least, I thought this is what was being talked about
> here
> >>>>>(and
> >>>>>> popped up a week or two ago) -- sorry if I completely missed the
> >>>>>>mark
> >>>>>> ...
> >>>>>>
> >>>>>> -steve
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
> >>>>>> <florian.hahne at novartis.com
> <mailto:florian.hahne at novartis.com>> wrote:
> >>>>>>
> >>>>>>> Hi Martin,
> >>>>>>> I could define the generics in my own package, but that
> would mean
> >>>>>>> that
> >>>>>>> those will only be available there, or in the global
> environment
> >>>>>>> assuming
> >>>>>>> that I also export them, or in all additional packages that
> >>>>>>> explicitly
> >>>>>>> import them from my name space. Now there already are a
> whole bunch
> >>>>>>> of
> >>>>>>> packages around that all allow for parallelization via a
> cluster
> >>>>>>> object.
> >>>>>>> Obviously those all import the parLapply function from the
> parallel
> >>>>>>> package. That means that I can't simply supply my own modified
> >>>>>>> cluster
> >>>>>>> object, because the code that calls parLapply will not know
> about
> >>>>>>>the
> >>>>>>> generic in my package, even if it is attached. Ideally
> parLapply
> >>>>>>> would
> >>>>>>> be
> >>>>>>> a generic function already in the parallel package. Not
> sure who
> >>>>>>> needs
> >>>>>>> to
> >>>>>>> be convinced in order for this to happen, but my gut
> feeling was
> >>>>>>> that it
> >>>>>>> could be easier to have the generic in BiocGenerics.
> >>>>>>> Maybe I am missing something obvious here, but imo there is
> no way
> >>>>>>>to
> >>>>>>> overwrite parLapply globally for my own class unless the
> generic is
> >>>>>>> imported by everyone who wants to make use of the special
> method.
> >>>>>>> Florian
> >>>>>>> --
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
> >>>>>>>
> >>>>>>> On 10/17/2012 05:45 AM, Hahne, Florian wrote:
> >>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>> I was wondering whether it would be possible to have proper
> >>>>>>>>> generics
> >>>>>>>>>
> >>>>>>>> for
> >>>>>>
> >>>>>>> some of the functions in the parallel package, e.g.
> parLapply and
> >>>>>>>>> clusterCall. The reason I am asking is because I want to
> build an
> >>>>>>>>> S4
> >>>>>>>>> class
> >>>>>>>>> that essentially looks like an S3 cluster object but
> knows how to
> >>>>>>>>> deal
> >>>>>>>>> with the SGE. That way I can abstract away all the overhead
> >>>>>>>>> regarding
> >>>>>>>>> job
> >>>>>>>>> submission, job status and reducing the results in the
> parLapply
> >>>>>>>>> method
> >>>>>>>>> of
> >>>>>>>>> that class, and would be able to supply this new cluster
> object
> >>>>>>>>>to
> >>>>>>>>> all
> >>>>>>>>> of
> >>>>>>>>> my existing functions that can be processed in parallel
> using a
> >>>>>>>>> cluster
> >>>>>>>>> object as input. I have played around with the BatchJobs
> package
> >>>>>>>>> as an
> >>>>>>>>> abstraction layer to SGE and that work nicely. As a test
> case I
> >>>>>>>>> have
> >>>>>>>>> created the necessary generics myself in order to supply
> my own
> >>>>>>>>> SGEcluster
> >>>>>>>>> object to a function that normally deals with the "regular"
> >>>>>>>>> parallel
> >>>>>>>>> package S3 cluster objects and everything just worked out
> of the
> >>>>>>>>> box,
> >>>>>>>>> but
> >>>>>>>>> obviously this fails once I am in a name space and my
> generic is
> >>>>>>>>> not
> >>>>>>>>> found
> >>>>>>>>> anymore. Of course what we would really want is some proper
> >>>>>>>>> abstraction
> >>>>>>>>> of
> >>>>>>>>> parallelization in R, but for now this seem to be at least a
> >>>>>>>>>cheap
> >>>>>>>>> compromise. Any thoughts on this?
> >>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Florian -- we talked about this locally, but I guess we
> didn't
> >>>>>>>> actually send
> >>>>>>>> any email!
> >>>>>>>>
> >>>>>>>> Is there an obstacle to promoting these to generics in
> your own
> >>>>>>>> package?
> >>>>>>>> The
> >>>>>>>> usual motivation for inclusion in BiocGenerics has been to
> avoid
> >>>>>>>> conflicts
> >>>>>>>> between packages, but I'm not sure whether this is the
> case (yet)?
> >>>>>>>> This
> >>>>>>>> would
> >>>>>>>> also add a dependency fairly deep in the hierarchy.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Martin
> >>>>>>>>
> >>>>>>>> Florian
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>>>>>>> 1100 Fairview Ave. N.
> >>>>>>>> PO Box 19024 Seattle, WA 98109
> >>>>>>>>
> >>>>>>>> Location: Arnold Building M1 B861
> >>>>>>>> Phone: (206) 667-2793
> >>>>>>>>
> >>>>>>>
> >>>>>>> ______________________________**_________________
> >>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>>>>>>
> >>>>>>>
> >>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
> >>>>>>>.c
> >>>>>>> h/mailman/listinfo/bioc-devel>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Steve Lianoglou
> >>>>>> Graduate Student: Computational Systems Biology
> >>>>>> | Memorial Sloan-Kettering Cancer Center
> >>>>>> | Weill Medical College of Cornell University
> >>>>>> Contact Info:
> >>>>>>
> >>>>>>http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian
> >>>>>>os
> >>>>>> /contact>
> >>>>>>
> >>>>>> ______________________________**_________________
> >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>>>>>
> >>>>>>
> >>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
> >>>>>>ch
> >>>>>> /mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>> [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________**_________________
> >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>>>>
> >>>>>
> >>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
> >>>>>h/
> >>>>> mailman/listinfo/bioc-devel>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>>> 1100 Fairview Ave. N.
> >>>> PO Box 19024 Seattle, WA 98109
> >>>>
> >>>> Location: Arnold Building M1 B861
> >>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
> >>>>
> >>>> ______________________________**_________________
> >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>>>
> >>>>
> >>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
> >>>>/m
> >>>> ailman/listinfo/bioc-devel>
> >>>>
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >
> >--
> >Hervé Pagès
> >
> >Program in Computational Biology
> >Division of Public Health Sciences
> >Fred Hutchinson Cancer Research Center
> >1100 Fairview Ave. N, M1-B514
> >P.O. Box 19024
> >Seattle, WA 98109-1024
> >
> >E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> >Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> >Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list