[Bioc-devel] Pushing towards a better home for matrix generics

Pages, Herve hp@ge@ @end|ng |rom |redhutch@org
Tue Jan 29 19:46:30 CET 2019


Yes the help system could enforce the full signature for the aliases but 
that means the end user then will have to always do 
?`colSums,SomeClass,ANY,ANY-method`, which feels unnecessary complicated 
and confusing in the case of a generic where dispatching on the 2nd and 
3rd arguments hardly makes sense.

Or are you saying that the help system should enforce an alias that 
strictly matches the signature explicitly used in the setMethod 
statement? Problem with this is that then there is no easy way for the 
end user to know a priori which form to use to access the man page. Is 
it ?`colSums,dgCMatrix,ANY,ANY-method` or is it 
?`colSums,dgCMatrix-method`. Right now when you type colSums<ENTERN> 
(after loading the Matrix package), you get this:

   > library(Matrix)
   > colSums
   standardGeneric for "colSums" defined from package "base"

   function (x, na.rm = FALSE, dims = 1, ...)
standardGeneric("colSums")
   <bytecode: 0x591c7d0>
   <environment: 0x591a408>
   Methods may be defined for arguments: x, na.rm, dims
   Use  showMethods("colSums")  for currently available ones.

This suggests that the correct form is ?`colSums,dgCMatrix,ANY,ANY-method`.

All this confusion can be avoided by specifying signature="x" in the 
definition of the implicit generic. It formalizes where dispatch really 
happens and sets expectations upfront. No loose ends.

Hope this makes sense,

H.

On 1/29/19 09:43, Martin Maechler wrote:
>>>>>> Pages, Herve
>>>>>>      on Tue, 29 Jan 2019 16:44:47 +0000 writes:
>      > Hi Martin.  Speed is not the concern: I just did some
>      > quick benchmarking and didn't observe any significant
>      > difference in method dispatch performance after doing
>      > setGeneric("toto", function(x, a=0, b=0, c=0)
>      > standardGeneric("toto")) vs doing setGeneric("toto",
>      > signature="x", function(x, a=0, b=0, c=0)
>      > standardGeneric("toto")).
>
>      > Here is the real concern to me:
>
>      > Aliases of the form
>      > \alias{colSums,dgCMatrix,ANY,ANY-method} are a real pain
>      > to maintain. It's also a pain for the end user to have to
>      > do ?`colSums,dgCMatrix,ANY,ANY-method` to access the man
>      > page for a particular method. I know Matrix uses "short"
>      > aliases i.e. aliases of the form
>      > \alias{colSums,dgCMatrix-method} so the user only needs to
>      > do ?`colSums,dgCMatrix-method` but there is a lot of
>      > fragility to the situation.
>
>      > Here is why: The exact form that needs to be used for
>      > these aliases can change anytime depending on what happens
>      > in one of the upstream packages (not a concern for the
>      > Matrix package because no upstream package defines colSums
>      > methods). More precisely: If all the colSums methods
>      > defined in the upstream packages and in my own package are
>      > defined with setMethod statements of the form:
>
>      >     setMethod("colSums", signature(x="SomeClass"), ...)
>
>      > then the aliases in the man pages must be of the form
>      > \alias{colSums,SomeClass-method} and the end user can just
>      > do ?`colSums,SomeClass-method`, which is good. But if
>      > **one** upstream package decides to use a setMethod
>      > statement of the form:
>
>      >    setMethod("colSums", signature(x="SomeClass",
>      > na.rm="ANY", dims="ANY"), ...)
>
>      > then the aliases for **all** the colSums methods in
>      > **all** the downstream packages now must be of the form
>      > \alias{colSums,SomeOtherClass,ANY,ANY-method}, even if the
>      > method for SomeOtherClass is defined with
>      > signature(x="SomeOtherClass").
>
> Hmm... but to me, the behavior you describe in the above paragraph
> seems rather an implementation "infelicity" in R's help / documentation system,
> than an intrinsic necessity.  Or have you thought more about
> this and discussed it with other S4 experts (John Chambers,
> Michael L., Martin Morgan, ...)  and came to a different conclusion?
>
> Very generally:
>
>    Just because the documentation (help system)
>    rules are implemented as they are should *NOT* influence "the
>    best way" to program things in R.
>
> and particularly for something such as S4 which has been adapted
> and tuned for a long time ...
>
> So, rather the documentation "setup" should adapt to what seems
> best from an R coding point of view.
>
> More specifically, if we are allowed to use short signatures in R code, i.e.,
>        signature(x=<someClass>)
> short for
>        signature(x=<someClass>, na.m="ANY", dims="ANY")
>
> then the documentation \alias{} should allow to use the same
> principle, as the documentation / help "keys" which \alias{.} constructs
> will be similarly uniquely determined
> (at least as long as other packages do not describe methods for
>   "my" <someClass>)
>
> So, the help / documentation (and "R CMD check" checks) should
> have been changed long ago, if you had sent patches to do so,
> n'est-ce pas?   :-) ;-)  [[yes, half jokingly]].
>
> Martin
>
>      > Also, as a consequence, now
>      > the end user has to use the long syntax to access the man
>      > pages for these methods. And if later the author of the
>      > upstream package decides to switch back to the
>      > setMethod("colSums", signature(x="SomeClass"), ...) form,
>      > then I have to switch back all the aliases in all my
>      > downstream packages to the short form again!
>
>      > This fragility of the alias syntax was one of the
>      > motivations for me to put many setGeneric statements of
>      > the form setGeneric("someGeneric", signature="x") in
>      > BiocGenerics over the years. So I don't have many dozens
>      > of aliases that suddenly break for mysterious reasons ('R
>      > CMD check' would suddenly starts reporting warnings for
>      > these aliases despite no changes in my package or in R).
>
>      > Best,
>
>      > H.
>
>      > On 1/29/19 03:16, Martin Maechler wrote:
>      >>>>>>> Michael Lawrence on Mon, 28 Jan 2019 20:47:58 -0800
>      >>>>>>> writes:
>      >> > That will have some consequences; for example, >
>      >> documentation aliases will need to change. Not sure how >
>      >> many packages will need to be fixed outside of Matrix,
>      >> but > it's not an isolated change. Martin might comment
>      >> on the > rationale for the full signature, since he
>      >> defined those > generics.
>      >>
>      >> > On Mon, Jan 28, 2019 at 7:21 PM Pages, Herve >
>      >> <hpages using fredhutch.org> wrote:
>      >> >>
>      >> >> Actually there is a 4th solution which is to modify
>      >> the >> definition of the implicit generics in the methods
>      >> >> package (file makeBasicFunsList.R) to make them
>      >> dispatch >> on their 1st arg only. Should be easy. Then
>      >> no package >> will need to use a setGeneric statement >>
>      >> anymore. Everybody will automatically get a clean >>
>      >> implicit generic. Including the Matrix package which >>
>      >> shouldn't need to be touched (except maybe for some >>
>      >> aliases in its man page that might need to be changed >>
>      >> from \alias{colSums,dgCMatrix,ANY,ANY-method} to >>
>      >> \alias{colSums,dgCMatrix-method}).
>      >> >>
>      >> >> Anybody wants to try to make a patch for this?
>      >>
>      >> >> H.
>      >>
>      >> I've already replied without having read the above two
>      >> messages.  In my reply I had indeed more or less argued
>      >> as Hervé does above.
>      >>
>      >> Michael, Hervé, .. : Why is it really so much better to
>      >> disallow dispatch for the other compulsory arguments?
>      >> Dispatch there allows to use methods for class "missing"
>      >> which is nicer in my eyes than the traditional default
>      >> argument + missing() "tricks".
>      >>
>      >> Is it mainly speed you are concerned about.  If yes, do
>      >> we have data (and data analysis) about performance here?
>      >>
>      >> Martin
>      >>
>      >> >>
>      >> >> On 1/28/19 19:00, Michael Lawrence wrote: > I agree
>      >> (2) >> is a good compromise. CC'ing Martin for his
>      >> perspective.
>      >> >> >
>      >> >> > Michael
>      >> >> >
>      >> >> > On Mon, Jan 28, 2019 at 6:58 PM Pages, Herve >>
>      >> <hpages using fredhutch.org> wrote: >> Hi Aaron,
>      >> >> >>
>      >> >> >> The 4 matrix summarization generics currently
>      >> defined >> in BiocGenerics >> are defined as followed:
>      >> >> >>
>      >> >> >> setGeneric("rowSums", signature="x") >> >>
>      >> setGeneric("colSums", signature="x") >> >>
>      >> setGeneric("rowMeans", signature="x") >> >>
>      >> setGeneric("colMeans", signature="x")
>      >> >> >>
>      >> >> >> The only reason for having these definitions in >>
>      >> BiocGenerics is to >> restrict dispatch the first >>
>      >> argument. This is cleaner than what we would >> get with
>      >> >> the implicit generics where dispatch is on all
>      >> arguments >> (it >> doesn't really make sense to dispatch
>      >> on toggles >> like 'na.rm' or >> 'dims'). Sticking to
>      >> simple dispatch >> when possible makes life easier for >>
>      >> the developer >> (especially in times of troubleshooting)
>      >> and for the user >> >> (methods are easier to discover
>      >> and their man pages >> easier to access).
>      >> >> >>
>      >> >> >> However, the 4 statements above create new generics
>      >> >> that mask the >> implicit generics defined in the
>      >> Matrix >> package (Matrix doesn't contain >> any
>      >> setGeneric >> statements for these generics, only
>      >> setMethod >> >> statements). This is a very unsatisfying
>      >> situation and it >> has hit me >> repeatedly over the
>      >> last couple of years.
>      >> >> >>
>      >> >> >> We have basically 3 ways to go. From simpler to
>      >> more >> complicated:
>      >> >> >>
>      >> >> >> 1) Give up on single dispatch for these
>      >> generics. That >> is, we remove the >> 4 statements above
>      >> from >> BiocGenerics. Then we use setMethod() in package
>      >> >> code >> like Matrix does.
>      >> >> >>
>      >> >> >> 2) Convince the Matrix folks to put the 4
>      >> statements >> above in Matrix.  >> Then any BioC package
>      >> that needs to >> define methods for these generics >>
>      >> would just need to >> import them from the Matrix
>      >> package. Maybe we could >> >> even push this one step
>      >> further by having BiocGenerics >> import and >> re-export
>      >> these generics. This would make >> them "available" in
>      >> BioC as >> soon as the BiocGenerics >> is loaded (and any
>      >> package that needs to define >> >> methods on them would
>      >> just need to import them from >> BiocGenerics).
>      >> >> >>
>      >> >> >> 3) Put the 4 statements above in a MatrixGenerics
>      >> >> package. Then convince >> the Matrix folks to define
>      >> >> methods on the generics defined in >> >>
>      >> MatrixGenerics. Very unlikely to happen!
>      >> >> >>
>      >> >> >> IMO 2) is the best compromise. Want to give it a
>      >> shot?
>      >> >> >>
>      >> >> >> H.
>      >> >> >>
>      >> >> >>
>      >> >> >> On 1/27/19 13:45, Aaron Lun wrote: >>> This is a >>
>      >> resurrection of some old threads:
>      >> >> >>>
>      >> >> >>>
>      >> >>
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_pipermail_bioc-2Ddevel_2017-2DNovember_012273.html&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=O21AQgvbUp3XRwM4jf0WeZA2ePj9yT3fc2X5hOsKNJk&s=pcpUyjpkQe6U79lZ_n2SANNp6Zj_s6i1Sq2yZx2NSjw&e=
>      >> >> >>>
>      >> >> >>>
>      >> >>
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_MatrixGenerics_issues&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=O21AQgvbUp3XRwM4jf0WeZA2ePj9yT3fc2X5hOsKNJk&s=NrmcVnmvgkDp3p64J-izZU9VD5nFsFCWOTI-TsnzCpY&e=
>      >> >> >>>
>      >> >> >>> For those who are unfamiliar with this, the basic
>      >> >> issue is that various >>> Matrix and BiocGenerics >>
>      >> functions mask each other. This is mildly >>> frustrating
>      >> >> in interactive sessions:
>      >> >> >>>
>      >> >> >>>> library(Matrix) >>>> library(DelayedArray) >>>> x
>      >> <- >> rsparsematrix(10, 10, 0.1) >>>> colSums(x) # fails
>      >> >>>> >> Matrix::colSums(x) # okay >>> ... but quite
>      >> annoying >> during package development, requiring code
>      >> like >>> this:
>      >> >> >>>
>      >> >> >>> if (is(x, "Matrix")) { >>> z <- Matrix::colSums(x)
>      >> >> >>> } else { >>> z <- colSums(x) # assuming
>      >> DelayedArray >> does the masking.  >>> }
>      >> >> >>>
>      >> >> >>> ... which defeats the purpose of using S4 dispatch
>      >> in >> the first place.
>      >> >> >>>
>      >> >> >>> I have been encountering this issue with
>      >> increasing >> frequency in my >>> packages, as a lot of
>      >> my code base >> needs to be able to interface with >>>
>      >> both Matrix and >> Bioconductor objects (e.g.,
>      >> DelayedMatrices) at the >>> >> same time. What needs to
>      >> happen so that I can just write:
>      >> >> >>>
>      >> >> >>> z <- colSums(x)
>      >> >> >>>
>      >> >> >>> ... and everything will work for both Matrix and
>      >> >> Bioconductor classes?  >>> It seems that many of these
>      >> >> function names are implicit generics >>> anyway, can
>      >> >> BiocGenerics take advantage of that for the time
>      >> being?
>      >> >> >>>
>      >> >> >>> Best,
>      >> >> >>>
>      >> >> >>> Aaron
>      >> >> >>>
>      >> >> >>> _______________________________________________
>      >> >>> >> Bioc-devel using r-project.org mailing list >>> >>
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=O21AQgvbUp3XRwM4jf0WeZA2ePj9yT3fc2X5hOsKNJk&s=JtgGBnaZJH44fV8OUp-SwnHxhD_i_mdVkqoMfUoA5tM&e=
>      >> >> >> --
>      >> >> >> Hervé Pagès
>      >> >> >>
>      >> >> >> Program in Computational Biology >> Division of
>      >> Public >> Health Sciences >> Fred Hutchinson Cancer
>      >> Research Center >> >> 1100 Fairview Ave. N, M1-B514 >>
>      >> P.O. Box 19024 >> >> Seattle, WA 98109-1024
>      >> >> >>
>      >> >> >> E-mail: hpages using fredhutch.org >> Phone: (206)
>      >> 667-5791 >> >> Fax: (206) 667-1319
>      >> >> >>
>      >> >> >> _______________________________________________ >>
>      >> >> Bioc-devel using r-project.org mailing list >> >>
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=c-Mmi30ouubEEHC5W9_X6DIwxblt1nQlIfgCaK8uCJU&s=U8Hu1kzglD_RP7t_eR5w_nYAIaupBgrEKx11geSZwVg&e=
>      >> >>
>      >> >> --
>      >> >> Hervé Pagès
>      >> >>
>      >> >> Program in Computational Biology Division of Public >>
>      >> Health Sciences Fred Hutchinson Cancer Research Center >>
>      >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>      >> >> 98109-1024
>      >> >>
>      >> >> E-mail: hpages using fredhutch.org Phone: (206) 667-5791
>      >> Fax: >> (206) 667-1319
>      >> >>
>
>      > --
>      > Hervé Pagès
>
>      > Program in Computational Biology Division of Public Health
>      > Sciences Fred Hutchinson Cancer Research Center 1100
>      > Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>      > 98109-1024
>
>      > E-mail: hpages using fredhutch.org Phone: (206) 667-5791 Fax:
>      > (206) 667-1319
>
-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list