[Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

Vincent Carey @tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Sat Feb 8 15:26:12 CET 2020


On Fri, Feb 7, 2020 at 6:39 PM stefano <mangiolastefano using gmail.com> wrote:

> Thanks Guys for the discussion (I am learning a lot),
>
> *To Martin:*
>
> Thanks for the tips. I will start to implement those S4 style methods
> https://github.com/stemangiola/ttBulk/issues/7
>
> I would *really *like to be part of Bioconductor community with this
> package, if just this
>
> > " One would expect the vignette and examples to primarily emphasize the
> use of the interoperable (SummmarizedExperiment) version. "
>
> Could become this
>
> > One would expect the vignette and examples to emphasize the use of the
> interoperable (SummmarizedExperiment) version.
>
> I agree with the integration priority of Bioconductor, but this repository
> (and this philosophy) is more than its data structures. There should be
> space for more than one approach to do things, given that the principle are
> respected.
>
> If this is true, I could really spend energies to use methods as you
> suggested and implement the SummarisedExperiment stream. And with the tips
> of the community the link will become stronger and stronger with time and
> versions.
>
>
> *To Vincent*
>
> Thanks a lot for the interest.
>
> *> One thing I feel is missing is an approach to the following question:
> [..] How do I make one that works the way ttBulk's operators work?*
>
> I'm afraid I don't really understand the question. Are you wondering about
> extension of the framework? Or creating a similar framework for other
> applications? Could you please reformulate, maybe giving a concrete
> example?
>

We can take further discussion to the issues on the github repo but I will
briefly respond here.  Consider reduce_dimensions.
You give a small number of method options here -- PCA, MDS, tSNE.  The MDS
option makes its way to stats::cmdscale via limma::plotMDS;
the PCA option uses prcomp.  For any number of reasons, users may want to
select alternate dimension reduction procedures or
tune them in ways not passed up through your interface.  This might involve
modifications to your code to introduce changes, or
one could imagine a protocol for "dropping in" a new operator for ttBulk
pipelines.  My question is to understand how this level
of flexibility might be achieved.

An example of an R package that pursues this is mlr3, see
https://github.com/mlr-org/mlr3learners.template ... a link there is broken
but the full details of contributing new pipeline elements are at
https://mlr3book.mlr-org.com/pipelines.html


> *> Are there patterns there that are preserved across different operators?
> *
>
> A commonality is the use of code for integrating the new calculated
> information (dplyr), validation functions, ..
>
> *> Can they be factored out to improve maintainability?*
>
> Almost surely yes, this is the first version, I hope to see enough
> interest, improve the API upon feedback, and hope for (intellectual and
> practical) contributions from more experts in software engineering.
>
> *> validObject *
>
> Seems a good method, and as far as I tested works for S3 objects as well.
> I will try to implement it. In fact I already added it as issue into Github
> https://github.com/stemangiola/ttBulk/issues/6
>
> At the moment I have a custom validation function
>
> Best wishes.
>
> *Stefano *
>
>
>
> Stefano Mangiola | Postdoctoral fellow
>
> Papenfuss Laboratory
>
> The Walter Eliza Hall Institute of Medical Research
>
> +61 (0)466452544
>
>
> Il giorno sab 8 feb 2020 alle ore 01:54 Vincent Carey <
> stvjc using channing.harvard.edu> ha scritto:
>
>> This is an interesting discussion and I hope it is ok to continue it a
>> bit.  I found the
>> readme for the ttBulk repo extremely enticing and I am sure many people
>> will want to
>> explore this way of working with genomic data.  I have only a few moments
>> to explore
>> it and did not read the vignette, but it looks to me as if it is mostly
>> recapitulated in the
>> README, which is an excellent overview.
>>
>> One thing I feel is missing is an approach to the following question: I
>> like the
>> idea of a pipe-oriented operator for programming steps in genomic
>> workflows.
>> How do I make one that works the way ttBulk's operators work?  Well, I can
>> have a look at ttBulk:::reduce_dimensions.ttBulk ...
>>
>> It's involved.  Are there patterns there that
>> are preserved across different operators?  Can
>> they be factored out to improve maintainability?
>>
>> One other point before I run
>>
>> It seems to me the operators "require" that certain
>> fields be defined in their tibble operands.
>>
>> > names(attributes(counts))
>>
>> [1] "names"      "class"      "row.names"  "parameters"
>>
>> > attributes(counts)$names
>>
>> [1] "sample"             "transcript"         "Cell type"
>>
>> [4] "count"              "time"               "condition"
>>
>> [7] "batch"              "factor_of_interest"
>>
>> > validObject(counts)
>>
>> *Error in .classEnv(classDef) : *
>>
>> *  trying to get slot "package" from an object of a basic class ("NULL")
>> with no slots*
>>
>>
>> Enter a frame number, or 0 to exit
>>
>>
>> 1: validObject(counts)
>>
>> 2: .classEnv(classDef)
>>
>>
>> I think you mentioned validity checking in a previous email.  This
>>
>> is a feature of S4 that is not frequently invoked.  Of course
>>
>> validObject will not work on counts, but do you have something similar?
>>
>> (Not all working S4 objects from Bioc will pass validObject tests, but
>>
>> they should....)
>>
>>
>>
>> On Fri, Feb 7, 2020 at 5:26 AM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>
>>> yes, absolutely. A common pattern might be to implement a generic
>>>
>>>     setGeneric("foo", function(x, ...) standardGeneric("foo"))
>>>
>>> an ‘internal’ function that implements the method on base R data types
>>>
>>>     .foo <- function(x) {
>>>         stopifnot("'x' must be a matrix" = is.matrix(x))
>>>         t(x)
>>>     }
>>>
>>> and methods that act as a facade to the implementation
>>>
>>>     setMethod("foo", "tbl_df", function(x) {
>>>         x <- as.matrix(x)
>>>         result <- .foo(x)
>>>         as_tibble(result)
>>>     })
>>>
>>>     setMethod("foo", "SummarizedExperiment", function(x) {
>>>         result <- .foo(assay(x))
>>>         assays(x)[["foo"]] <- result
>>>         x
>>>     })
>>>
>>> One would expect the vignette and examples to primarily emphasize the
>>> use of the interoperable (SummmarizedExperiment) version.
>>>
>>> Martin Morgan
>>>
>>> From: stefano <mangiolastefano using gmail.com>
>>> Date: Friday, February 7, 2020 at 12:31 AM
>>> To: Michael Lawrence <lawrence.michael using gene.com>
>>> Cc: Martin Morgan <mtmorgan.bioc using gmail.com>, "bioc-devel using r-project.org"
>>> <bioc-devel using r-project.org>
>>> Subject: Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse
>>> S3 classes/methods
>>>
>>> Would this scenario satisfy " make the package _directly_ compatible
>>> with standard Bioconductor data structures"
>>>
>>> If an input is SummarizedExperiment return SummarizedExperiment, if the
>>> input is a tbl_df or ttBulk, return ttBulk (?)
>>>
>>>
>>> Best wishes.
>>> Stefano
>>>
>>> Stefano Mangiola | Postdoctoral fellow
>>> Papenfuss Laboratory
>>> The Walter Eliza Hall Institute of Medical Research
>>> +61 (0)466452544
>>>
>>>
>>> Il giorno ven 7 feb 2020 alle ore 16:15 Michael Lawrence <mailto:
>>> lawrence.michael using gene.com> ha scritto:
>>> I would urge you to make the package _directly_ compatible with
>>> standard Bioconductor data structures; no explicit conversion. But you
>>> can create wrapper methods (even on an S3 generic) that perform the
>>> conversion automatically. You'll probably want two separate APIs
>>> though (in different styles), for one thing automatic conversion is
>>> obviously not possible for return values.
>>>
>>> Michael
>>>
>>> On Thu, Feb 6, 2020 at 5:34 PM stefano <mailto:mangiolastefano using gmail.com>
>>> wrote:
>>> >
>>> > Thanks Michael,
>>> >
>>> > yes in a sense, ttBulk and SummariseExperiment can be considere as two
>>> interfaces. Would be fair enough to create a function that convert from one
>>> to the other, although the default would be ttBulk?
>>> >
>>> > > I'm not sure the tidyverse is a great answer to the user interface,
>>> because it lacks domain semantics
>>> >
>>> > Would be fair to say that ttBulk class could be considered a tibble
>>> with specific semantics? In the sense that it holds information about key
>>> column names (.sample, .transcript, .abundance, .normalised_abundance,
>>> etc..), and has a validator (that is triggered at every ttBulk function).
>>> >
>>> > I think at the moment, given (i) S3 problem, and (ii) the lack of
>>> formal foundation on SummaisedExperiment interface (that maybe would
>>> require an S4 technology itself, where SummariseExperiment could be a
>>> slot?) my package would belong more to CRAN, until those two issues will
>>> have been resolved.
>>> >
>>> > I imagine there are not many cases where a CRAN package migrated to
>>> Bioconductor after complying with the ecosystem policies.
>>> >
>>> > Thanks a lot.
>>> >
>>> > Best wishes.
>>> >
>>> > Stefano
>>> >
>>> >
>>> >
>>> > Stefano Mangiola | Postdoctoral fellow
>>> >
>>> > Papenfuss Laboratory
>>> >
>>> > The Walter Eliza Hall Institute of Medical Research
>>> >
>>> > +61 (0)466452544
>>> >
>>> >
>>> >
>>> > Il giorno ven 7 feb 2020 alle ore 12:12 Michael Lawrence <mailto:
>>> lawrence.michael using gene.com> ha scritto:
>>> >>
>>> >> There's a difference between implementing software, where one wants
>>> >> formal data structures, and providing a convenient user interface.
>>> >> Software needs to interface with other software, so a package could
>>> >> provide both types of interfaces, one based on rich (S4) data
>>> >> structures, another on simpler structures with an API more amenable to
>>> >> analysis. I'm not sure the tidyverse is a great answer to the user
>>> >> interface, because it lacks domain semantics. This is still an active
>>> >> area of research (see Stuart Lee's plyranges, for example). I hope you
>>> >> can find a reasonable compromise that enables you to integrate ttBulk
>>> >> into Bioconductor, so that it can take advantage of the synergies the
>>> >> ecosystem provides.
>>> >>
>>> >> PS: There is no simple fix for your example.
>>> >>
>>> >> Michael
>>> >>
>>> >> On Thu, Feb 6, 2020 at 4:12 PM stefano <mailto:
>>> mangiolastefano using gmail.com> wrote:
>>> >> >
>>> >> > Thanks a lot for your comment Martin and Michael,
>>> >> >
>>> >> > Here I reply to Marti's comment. Michael I will try to implement
>>> your
>>> >> > solution!
>>> >> >
>>> >> > I think a key point from
>>> >> >
>>> https://github.com/Bioconductor/Contributions/issues/1355#issuecomment-580977106
>>> >> > (that I was under-looking) is
>>> >> >
>>> >> > *>> "So to sum up: if you submit a package to Bioconductor, there
>>> is an
>>> >> > expectation that your package can work seamlessly with other
>>> Bioconductor
>>> >> > packages, and your implementation should support that. The safest
>>> and
>>> >> > easiest way to do that is to use Bioconductor data structures"*
>>> >> >
>>> >> > In this case my package would not be suited as I do not use
>>> pre-existing
>>> >> > Bioconductor data structures, but instead i see value in using a
>>> simple
>>> >> > tibble, for the reasons in part explained in the README
>>> >> > https://github.com/stemangiola/ttBulk (harvesting the power of
>>> tidyverse
>>> >> > and friends for bulk transcriptomic analyses).
>>> >> >
>>> >> > *>> "with the minimum standard of being able to accept such objects
>>> even if
>>> >> > you do not rely on them internally (though you should)"*
>>> >> >
>>> >> > With this I can comply in the sense that I can built converters to
>>> and from
>>> >> > SummarizedExperiment (for example).
>>> >> >
>>> >> > * >> "If you don't want to do that, then that's a shame, but it
>>> would
>>> >> > suggest that Bioconductor would not be the right place to host this
>>> >> > package."*
>>> >> >
>>> >> > Well said.
>>> >> >
>>> >> > In summary, I do not rely on Bioconductor data structure, as I am
>>> proposing
>>> >> > another paradigm, but my back end is made of largely Bioconductor
>>> analysis
>>> >> > packages that I would like to interface with tidyverse. So
>>> >> >
>>> >> > 1) Should I build converters to Bioc. data structures, and force
>>> the use of
>>> >> > S3 object (needed to tiidyverse to work), or
>>> >> > 2) Submit to CRAN
>>> >> >
>>> >> > I don't have strong feeling for either, although I think
>>> Bioconductor would
>>> >> > be a good fit. Please community give me your honest opinions, I
>>> will take
>>> >> > them seriously and proceed.
>>> >> >
>>> >> >
>>> >> >
>>> >> > Best wishes.
>>> >> >
>>> >> > *Stefano *
>>> >> >
>>> >> >
>>> >> >
>>> >> > Stefano Mangiola | Postdoctoral fellow
>>> >> >
>>> >> > Papenfuss Laboratory
>>> >> >
>>> >> > The Walter Eliza Hall Institute of Medical Research
>>> >> >
>>> >> > +61 (0)466452544
>>> >> >
>>> >> >
>>> >> > Il giorno ven 7 feb 2020 alle ore 10:46 Martin Morgan <
>>> >> > mailto:mtmorgan.bioc using gmail.com> ha scritto:
>>> >> >
>>> >> > > The idea isn't to use S4 at any cost, but to 'play well' with the
>>> >> > > Bioconductor ecosystem, including writing robust and maintainable
>>> code.
>>> >> > >
>>> >> > > This comment
>>> >> > >
>>> https://github.com/Bioconductor/Contributions/issues/1355#issuecomment-580977106
>>> >> > > provides some motivation; there was also an interesting exchange
>>> on the
>>> >> > > Bioconductor community slack about this (join at
>>> >> > > https://bioc-community.herokuapp.com/; discussion starting with
>>> >> > >
>>> https://community-bioc.slack.com/archives/C35G93GJH/p1580144746014800).
>>> >> > > The plyranges package http://bioconductor.org/packages/plyranges
>>> and
>>> >> > > recently accepted fluentGenomics workflow
>>> >> > > https://github.com/Bioconductor/Contributions/issues/1350 provide
>>> >> > > illustrations.
>>> >> > >
>>> >> > > In your domain it's really surprising that your package does not
>>> use
>>> >> > > (Import or Depend on) SummarizedExperiment or GenomicRanges
>>> packages. From
>>> >> > > a superficial look at your package, it seems like something like
>>> >> > > `reduce_dimensions()` could be defined to take & return a
>>> >> > > SummarizedExperiment and hence benefit from some of the points in
>>> the
>>> >> > > github issue comment mentioned above.
>>> >> > >
>>> >> > > Certainly there is a useful transition, both 'on the way in' to a
>>> >> > > SummarizedExperiment, and after leaving the more specialized
>>> bioinformatic
>>> >> > > computations to, e.g., display a pairs plot of the reduced
>>> dimensions,
>>> >> > > where one might re-shape the data to a tidy format and use 'plain
>>> old'
>>> >> > > tibbles; the fluentGenomics workflow might provide some guidance.
>>> >> > >
>>> >> > > At the end of the day it would not be surprising for Bioconductor
>>> packages
>>> >> > > to make use of tidy concepts and data structures, particularly in
>>> the
>>> >> > > vignette, and it would be a mistake for Bioconductor to exclude
>>> >> > > well-motivated 'tidy' representations.
>>> >> > >
>>> >> > > Martin Morgan
>>> >> > >
>>> >> > > On 2/6/20, 5:46 PM, "Bioc-devel on behalf of stefano" <
>>> >> > > mailto:bioc-devel-bounces using r-project.org on behalf of mailto:
>>> mangiolastefano using gmail.com>
>>> >> > > wrote:
>>> >> > >
>>> >> > >     Hello,
>>> >> > >
>>> >> > >     I have a package (ttBulk) under review. I have been told to
>>> replace
>>> >> > > the S3
>>> >> > >     system to S4. My package is based on the class tbl_df and
>>> must be fully
>>> >> > >     compatible with tidyverse methods (inheritance). After some
>>> tests and
>>> >> > >     research I understood that tidyverse ecosystem is not
>>> compatible with
>>> >> > > S4
>>> >> > >     classes.
>>> >> > >
>>> >> > >      For example, several methos do not apparently handle S4
>>> objects based
>>> >> > > on
>>> >> > >     S3 tbl_df
>>> >> > >
>>> >> > >     ```library(tidyverse)setOldClass("tbl_df")
>>> >> > >     setClass("test2", contains = "tbl_df")
>>> >> > >     my <- new("test2",  tibble(a = 1))
>>> >> > >     my %>%  mutate(b = 3)
>>> >> > >
>>> >> > >        a b
>>> >> > >     1 1 3
>>> >> > >     ```
>>> >> > >
>>> >> > >      ```my <- new("test2",  tibble(a = rnorm(100), b = 1))
>>> >> > >     my %>% nest(data = -b)
>>> >> > >     Error: `x` must be a vector, not a `test2` object
>>> >> > >     Run `rlang::last_error()` to see where the error occurred.
>>> >> > >     ```
>>> >> > >
>>> >> > >     Could you please advise whether a tidyverse based package can
>>> be
>>> >> > > hosted on
>>> >> > >     Bioconductor, and if S4 classes are really mandatory? I need
>>> to
>>> >> > > understand
>>> >> > >     if I am forced to submit to CRAN instead (although
>>> Bioconductor would
>>> >> > > be a
>>> >> > >     good fit, sice I try to interface transcriptional analysis
>>> tools to
>>> >> > > tidy
>>> >> > >     universe)
>>> >> > >
>>> >> > >
>>> >> > >     Thanks a lot.
>>> >> > >     Stefano
>>> >> > >
>>> >> > >         [[alternative HTML version deleted]]
>>> >> > >
>>> >> > >     _______________________________________________
>>> >> > >     mailto:Bioc-devel using r-project.org mailing list
>>> >> > >     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >> > >
>>> >> > >
>>> >> >
>>> >> >         [[alternative HTML version deleted]]
>>> >> >
>>> >> > _______________________________________________
>>> >> > mailto:Bioc-devel using r-project.org mailing list
>>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Michael Lawrence
>>> >> Senior Scientist, Bioinformatics and Computational Biology
>>> >> Genentech, A Member of the Roche Group
>>> >> Office +1 (650) 225-7760
>>> >> mailto:michafla using gene.com
>>> >>
>>> >> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>>>
>>>
>>>
>>> --
>>> Michael Lawrence
>>> Senior Scientist, Bioinformatics and Computational Biology
>>> Genentech, A Member of the Roche Group
>>> Office +1 (650) 225-7760
>>> mailto:michafla using gene.com
>>>
>>> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> The information in this e-mail is intended only for the person to whom it
>> is
>> addressed. If you believe this e-mail was sent to you in error and the
>> e-mail
>> contains patient information, please contact the Partners Compliance HelpLine
>> at
>> http://www.partners.org/complianceline . If the e-mail was sent to you
>> in error
>> but does not contain patient information, please contact the sender and
>> properly
>> dispose of the e-mail.
>
>

-- 
The information in this e-mail is intended only for the person to whom it 
is
addressed. If you believe this e-mail was sent to you in error and the 
e-mail
contains patient information, please contact the Partners Compliance 
HelpLine at
http://www.partners.org/complianceline 
<http://www.partners.org/complianceline> . If the e-mail was sent to you in 
error
but does not contain patient information, please contact the sender 
and properly
dispose of the e-mail.

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list