[Bioc-devel] parallel package generics

Fri Oct 26 00:21:29 CEST 2012

Herve,

Thanks so much.  I'm learning a fair bit about R/OO/S3/S4 in this fray.   I realized the interface/implementation error after hitting send, but your exposition was VERY HELPFUL to me in understanding how the generics get set up, when, and why.

You and Martin both have approaches that would allow mclapply and Reduce to work as written, allowing for
1)  removal of BiocGenerics::Reduce
2)  not adding to BiocGenerics::mclapply

This is all good.

However, it seems to me that both approaches are going to needlessly materialize a (GRanges)List as a base::list at potentially great cost in time.

Depending on what function you are Reducing/mcapplying, the difference can be HUGE.

I still think we need to fix/redefine the upstream to maintain the typical BioC amazing performance profile without cluttering up BiocGenerics.

--Malcolm

On 10/25/2012 11:53 AM, Cook, Malcolm wrote:
I might be willing with a little more ammo.

But, I need some guidance/education first….

Let's see if the following question helps me get it from you (you know who you are), or gets my head bitten off in this forum…

Question: Why do we have BiocGenerics at all?

Let's distinguish between 2 kinds of generics: (1) functions
in base R (i.e. packages base, stats, graphics, parallel, etc...)
that we want to turn into S4 generics, and (2) other generic
functions introduced in Bioconductor.

The main motivation for the BiocGenerics package was to have a
central place for generics of the 1st kind. The need for a central
place has to do with the need of a clear ownership of the *generic*,
and the notion of ownership is blurred by the fact that base functions
can be implicitly turned into generics by a simple attempt to attach
a method to them (with setMethod).

Now with more details. Any base function foo that is not already
a generic can implicitly be turned into a generic by any package
that contains a setMethod("foo", ...) statement. Even if those
packages are good citizens (by not trying to explicitly turn foo
into a generic with setGeneric(), which would have caused even
more problems), this approach is not satisfying. We still have
the need to explicitly turn foo into a generic in 1 place (with
a single setGeneric statement in the entire Bioconductor world),
for at least 2 reasons:

   (1) The implicit generic always dispatches on all its arguments.
       For many functions, this is not desirable (e.g. what's the
       point to have the Reduce generic dispatch on its 'right'
       argument?)

   (2) When a packageimplicitly turns foo into a generic, it needs
       to export that generic. It also needs to add an alias for the
       *generic* (i.e. \alias{foo}) somewhere in its man pages, in
       addition to the alias for the foo *method*. Otherwise it gets
       an 'R CMD check' warning (kind of fair, the foo generic being
       a new thing that needs to be documented somewhere). So with a
       model where foo is never turned explicitly into a generic by
       any package, we are in a situation where each package that
       contains a setMethod("foo", ...) statement needs to assume that
       this statement will trigger the creation of the implicit generic,
       and therefore needs to assume ownership of that generic (by
       exporting and documenting it). But what will really happen is
       that only one package will effectively get the ownership, and
       it will be the first package to be loaded! Not good.

For those 2 reasons we decided to use a central place (BiocGenerics)
to explicitly turn foo (and other base functions) into a generic.
Now the ownership of the foo generic is known in advance and any
developer that needs to define a method for that generic knows where
to look for that generic (i.e. s/he knows where to import the generic
from and where to find the man page for the generic). In addition,
now we can specify the arguments that are involved in the dispatch.

I notice for instance that the definition of Reduce that it provides is different in only one respect from base::Reduce.

Namely, base::Reduce coerces X to a list using as.list  if it is an object whereas BiocGenerics::Reduce does not.  Otherwise they are identical.

Just to clarify, BiocGenerics::Reduce does not provide an
implementation at all:

   > BiocGenerics::Reduce
   standardGeneric for "Reduce" defined from package "BiocGenerics"

   function (f, x, init, right = FALSE, accumulate = FALSE)
standardGeneric("Reduce")
   <environment: 0x20ee0a0>
   Methods may be defined for arguments: x
   Use  showMethods("Reduce")  for currently available ones.

It's just a generic function (i.e. the only thing it does it
dispatching). With only the BiocGenerics package loaded in your
session, there is only 1 method defined for that generic:

   > showMethods("Reduce")
   Function: Reduce (package BiocGenerics)
   f="ANY"

We call this method the "default method", because it's the one that
will be used if no other more specific method is available for the
object passed to it. And this default method is just base::Reduce:

   > selectMethod("Reduce", "ANY")
   Method Definition (Class "derivedDefaultMethod"):

   function (f, x, init, right = FALSE, accumulate = FALSE)
   {
   <SNIP>
   }
   <environment: namespace:base>

   Signatures:
           f
   target  "ANY"
   defined "ANY"

See the environment in which this method is defined? It's defined in
base, which is the proof that this method is really base::Reduce.

Note that, to add to the confusion, there is a bug in how showMethods()
displays the name of the argument used for dispatch: it's 'x', not 'f'.

The only code BiocGenerics contains with respect to Reduce is:

   setGeneric("Reduce", signature="x")

Pretty light isn't it? setGeneric() automatically sets base::Reduce
as the default method.

As a result, base::Reduce(myGRangesList) fails with "no method for coercing this S4 class to a vector".

Yes, because base::Reduce calls base::as.list internally and
base::as.list doesn't work on a GRangesList object. Note that if
you look at the implementation of base::Reduce, you won't see a
call to base::as.list, only a call to as.list. But base::as.list
is really what is being called, because that's the only as.list
function that exists from within the environment where base::Reduce
is defined, namely the namespace:base environment.

But, similarly as to my argument regarding pvec and mclapply, is not the "right thing to do" to FIX base::Reduce to NOT do this coercion instead of introducing a new generic?

I agree there is definitely room for improving some of the functions
defined in base that conceptually need only basic things like length(),
[, [[ to work on an object x. As long as those things are themselves
generics defined in base. Then I can implement methods for those basic
things, and have suddenly a lot of other things in base that work
out-of-the-box.

For Reduce and as.list though, it seems that *we* could do a better job.
This is because as.list is itself an S3 generic. My understanding
is that if we had an as.list.List method (in addition to the "as.list"
*S4* method for List objects), then base::as.list would work on any
List object (e.g. GRangesList object), and so would any base function
that uses as.list internally (like lapply or Reduce). I definitely
want to take the time to explore that approach, because my feeling is
that we could simplify things significantly by not turning lapply,
Reduce, and a bunch of other things, into S4 generics, and by
dropping a lot of methods (currently defined in IRanges) that we
shouldn't need to have.

If it didn't do this coersion then it would work with anything which implements `[[` and `seq` (including Vector).

Why introduce generics for things that are defined in terms of sequence access primitives?  Is there a good reason I am missing?

Conceptually, we should not need to do that, I agree. At least for
things in base that are defined in terms of sequence access primitives,
as long as those sequence access primitives are S3 generics. Otherwise
we need to make those things generic. Another reason for making
something a generic (even if it works out-of-the-box on any object)
is performance. People might want to implement a specific method for
their objects that improves on the out-of-the-box performance.

If you agree, then, there are (at least) two upstream packages to 'fix': (1) Functional and (2) parallel.

I'd say there is no need to fix the funprog functions (if that's what
you mean by Functional). I hope we can make Reduce and family just work
on any List object by adding as.list.List (S3 method) to IRanges.
However, you are right that it feels that we shoudln't even have to put
anything in IRanges for having as.list work on any S4 object for which
length() and [[ defined. Seems like maybe this could be achieved by
modifying as.list.default (defined in base):

Would be something like:

   as.list.default <- function (x, ...)
   {
     if (typeof(x) == "list")
         return(x)
     if (!is.object(x))
         return(.Internal(as.vector(x, "list")))
     lapply(seq_len(length(x)), function(i) x[[i]])
   }

Instead of:

   as.list.default <- function (x, ...)
   {
     if (typeof(x) == "list") x else .Internal(as.vector(x, "list"))
   }

As for parallel, yes, some of the functions in the "snow family" need
to be made generics. If not S4, at least S3 generics.

H.

Do you agree, or can you educate me otherwise?

--Malcolm

From: Tim Triche <tim.triche at gmail.com<mailto:tim.triche at gmail.com><mailto:tim.triche at gmail.com>>
Reply-To: "ttriche at usc.edu<mailto:ttriche at usc.edu><mailto:ttriche at usc.edu>" <ttriche at usc.edu<mailto:ttriche at usc.edu><mailto:ttriche at usc.edu>>
Date: Thu, 25 Oct 2012 13:21:46 -0500
To: Florian Hahne <florian.hahne at novartis.com<mailto:florian.hahne at novartis.com><mailto:florian.hahne at novartis.com>>
Cc: Hervé Pagès <hpages at fhcrc.org<mailto:hpages at fhcrc.org><mailto:hpages at fhcrc.org>>, Malcolm Cook <mec at stowers.org<mailto:mec at stowers.org><mailto:mec at stowers.org>>, "Michael com>" <lawrence.michael at gene.com<mailto:lawrence.michael at gene.com><mailto:lawrence.michael at gene.com>>, "bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:bioc-devel at r-project.org>" <bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:bioc-devel at r-project.org>>
Subject: Re: [Bioc-devel] parallel package generics

+1 although this time around I would prefer if someone else would stick their neck out :-)

On Thu, Oct 25, 2012 at 11:12 AM, Hahne, Florian <florian.hahne at novartis.com<mailto:florian.hahne at novartis.com><mailto:florian.hahne at novartis.com>> wrote:
For me the cleanest option with the least impact would be to have this
fixed directly in the parallel package. However I think that somebody with
more influence should suggest that to Rdevel.
If they will not do it, the other options seem all more or less equivalent
to me.
Florian
--

On 10/25/12 12:08 AM, "Hervé Pagès" <hpages at fhcrc.org<mailto:hpages at fhcrc.org><mailto:hpages at fhcrc.org>> wrote:

Hi,

With Florian use case, there seems to be a strong/immediate need for
dispatching on the cluster-like object passed as the 1st argument to
parLapply() and all the other functions in the parallel package that
belong to the "snow family" (14 functions in total, all documented in
?parallel::parLapply). So we've just added those 14 generics to
BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
mclapply(), mcmapply(), and pvec()) for now.

Note that the 14 new generics dispatch at least on their 1st argument
('cl'), but also on their 2nd argument when this argument is 'x', 'X'
or 'seq' (expected to be a vector-like or matrix-like object). This
opens the door to defining methods that take advantage of the of the
implementation of particular vector-like or matrix-like objects.

Also note that, even if some of the 14 functions in the "snow family"
are simple convenience wrappers to other functions in the family, we've
made all of them generics. For example clusterEvalQ() is a simple
wrapper to clusterCall():

    > clusterEvalQ
    function (cl = NULL, expr)
    clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
    <environment: namespace:parallel>

And it seems (at least intuitively) that implementing a "clusterCall"
method for my cluster-like objects should be enough to have
clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
this is not the case:

    setClass("FakeCluster", representation(nnodes="integer"))

    setMethod("clusterCall", "FakeCluster",
        function (cl=NULL, fun, ...) fun(...)
    )

Then:

    > mycluster <- new("FakeCluster", nnodes=10L)
    > clusterCall(mycluster, print, 1:6)
    [1] 1 2 3 4 5 6
    > clusterEvalQ(mycluster, print(1:6))
    Error in checkCluster(cl) : not a valid cluster

This is because the "clusterEvalQ" default method is calling
parallel::clusterCall() (which is *not* the generic), instead of
calling BiocGenerics::clusterCall() (which *is* the generic).

This would be avoided if clusterCall() was a generic defined in
the parallel package itself (or in a package that parallel depends
on). And this would of course be a better solution than having those
generics in BiocGenerics. Is someone willing to bring that case to
R-devel?

In the mean time I need to define a "clusterEvalQ" method:

    setMethod("clusterEvalQ", "FakeCluster",
        function (cl=NULL, expr)
            clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
    )

And then:

    > clusterEvalQ(mycluster, print(1:6))
    [1] 1 2 3 4 5 6

Finally note that this method I defined for my objects could be made the
default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we
could put it in BiocGenerics. Or, since there is apparently nothing to
win by having clusterEvalQ() being a generic in the first place, we
could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
This function would be implemented *exactly* like
parallel::clusterEvalQ() (and it would mask it), except that now
it would call BiocGenerics::clusterCall() internally.

What should we do?

H.

On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com<mailto:lawrence.michael at gene.com><mailto:lawrence.michael at gene.com>>
wrote:

I agree that it would fruitful to have parLapply in BiocGenerics. It
looks
to be a flexible abstraction and its presence in the parallel package
makes
it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
would be good candidates, as well. The fork-based parallelism is
substantively different in terms of the API from the more general
parallelism of parLapply.

Someone was working on some more robust and convenient wrappers around
mclapply. Did that ever see the light of day?

If you are referring to

http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660

in which I had offered some small changes to parallel::pvec

       https://gist.github.com/3757873/

and after which Martin had provided me with a route I have not (yet?)
followed toward submitting a patch to R for consideration by R-devel /
Simon Urbanek in

http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
th
er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c

On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com<mailto:mailinglist.honeypot at gmail.com><mailto:mailinglist.honeypot at gmail.com>**> wrote:

    In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to
necessarily by "burried" in the ShortRead package (aside from it
having a `sr` prefix).

I don't know that srapply necessarily 'got it right'...

One thing I like about srapply is its support for a reduce argument.

I had thought it might be a good idea to move that (or something
like
that) to BiocGenerics (unless implementations aren't allowed there)
but also realized that it would add more dependencies where someone
might not necessarily need them.

But, almost surely, a large majority of the people will be happy to
do
some form of ||-ization, so in my mind it's not such an onerous
thing
to add -- on the other hand, this large majority is probably
enriched
for people who are doing NGS analysis, in which case, keeping it in
ShortRead can make some sense.

I remain confused about the need for putting any of this into
BiocGenerics
at all.  It seems to me that properly construed parallization primitives
ought to 'just work' with any object which supports indexing and length.

I would appreciate hearing arguments to the contrary.

Florian, in a similar vein, could we not seek to change
parallel::makeCluster to be extensible to, say, support SGE cluster?
THis
seems like the 'right thing to do'.  ???

Regardless, I think we have raised some considerations that might inform
improvements to parallel, including points about error handling,
reducing
results, block-level parallization over List/Vector (in addition to
vector), etc.

I think perhaps having a google doc that we can collectively edit to
contain the requirements we are trying to achieve might move us forward
effectively. Would this help? Or perhaps a page under
http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???

Taking one step back, I recall some chatter last week (or two) about
some better ||-ization "primitives" -- something about a pvec
doo-dad,
and there being ideas to wrap different types of ||-ization behind
an
easy to use interface (I think this was the convo), and then I took
a
further step back and often wonder why we just don't bite the bullet
and take advantage of the `foreach` infrastructure that is already
out
there -- in which case, I could imagne a "doSGE" package that might
handle the particulars of what Florain is referring to. You could
then
configure it externally via some
`registerDoSGE(some.config.**object)`
and just have the package code happily run it through `foreach(...)
%dopar%` and be done w/ it.

    IMHO it is relevant.  I have not looked for other abstractions,
and
this
one seems
to work.  Florian's objectives might be a good test case for
adequacy.

The registerDoDah does seem to be a useful abstraction.

Is this not more-or-less the intention of parallel::setDefaultCluster?

--Malcolm

I think there's a lot of work to do for some sort of coordinated
parallelization that putting parLapply into BiocGenerics might
encourage;
not good things will happen when everyone in a call stack tries to
parallelize independently. But I'm in favor of parLapply in
BiocGenerics at
least for the moment.

Martin

    ... at least, I thought this is what was being talked about here
(and
popped up a week or two ago) -- sorry if I completely missed the
mark
...

-steve

On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
<florian.hahne at novartis.com<mailto:florian.hahne at novartis.com><mailto:florian.hahne at novartis.com>> wrote:

Hi Martin,
I could define the generics in my own package, but that would mean
that
those will only be available there, or in the global environment
assuming
that I also export them, or in all additional packages that
explicitly
import them from my name space. Now there already are a whole bunch
of
packages around that all allow for parallelization via a cluster
object.
Obviously those all import the parLapply function from the parallel
package. That means that I can't simply supply my own modified
cluster
object, because the code that calls parLapply will not know about
the
generic in my package, even if it is attached. Ideally parLapply
would
be
a generic function already in the parallel package. Not sure who
needs
to
be convinced in order for this to happen, but my gut feeling was
that it
could be easier to have the generic in BiocGenerics.
Maybe I am missing something obvious here, but imo there is no way
to
overwrite parLapply globally for my own class unless the generic is
imported by everyone who wants to make use of the special method.
Florian
--

On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org<mailto:mtmorgan at fhcrc.org><mailto:mtmorgan at fhcrc.org>> wrote:

    On 10/17/2012 05:45 AM, Hahne, Florian wrote:

Hi all,
I was wondering whether it would be possible to have proper
generics

for

some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an
S4
class
that essentially looks like an S3 cluster object but knows how to
deal
with the SGE. That way I can abstract away all the overhead
regarding
job
submission, job status and reducing the results in the parLapply
method
of
that class, and would be able to supply this new cluster object
to
all
of
my existing functions that can be processed in parallel using a
cluster
object as input. I have played around with the BatchJobs package
as an
abstraction layer to SGE and that work nicely. As a test case I
have
created the necessary generics myself in order to supply my own
SGEcluster
object to a function that normally deals with the "regular"
parallel
package S3 cluster objects and everything just worked out of the
box,
but
obviously this fails once I am in a name space and my generic is
not
found
anymore. Of course what we would really want is some proper
abstraction
of
parallelization in R, but for now this seem to be at least a
cheap
compromise. Any thoughts on this?

Hi Florian -- we talked about this locally, but I guess we didn't
actually send
any email!

Is there an obstacle to promoting these to generics in your own
package?
The
usual motivation for inclusion in BiocGenerics has been to avoid
conflicts
between packages, but I'm not sure whether this is the case (yet)?
This
would
also add a dependency fairly deep in the hierarchy.

What do you think?

Martin

    Florian

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________**_________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
.c
h/mailman/listinfo/bioc-devel>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
     | Memorial Sloan-Kettering Cancer Center
     | Weill Medical College of Cornell University
Contact Info:

http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian
os
/contact>

______________________________**_________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
ch
/mailman/listinfo/bioc-devel>

           [[alternative HTML version deleted]]

______________________________**_________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
h/
mailman/listinfo/bioc-devel>

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793<tel:%28206%29%20667-2793>

______________________________**_________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
/m
ailman/listinfo/bioc-devel>

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org<mailto:hpages at fhcrc.org><mailto:hpages at fhcrc.org>
Phone:  (206) 667-5791<tel:%28206%29%20667-5791>
Fax:    (206) 667-1319<tel:%28206%29%20667-1319>

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
A model is a lie that helps you see the truth.

Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org<mailto:hpages at fhcrc.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319