[Bioc-devel] Changes to the SummarizedExperiment Class

Tim Triche, Jr. tim.triche at gmail.com
Wed Mar 4 18:33:18 CET 2015


so I'm told:

https://github.com/vjcitn/biocMultiAssay/blob/master/R/triche.R



Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Mar 4, 2015 at 9:01 AM, Robert Castelo <robert.castelo at upf.edu>
wrote:

> some of the goals behind this discussion are IMO similar to the ones for
> biocMultiAssay:
>
> https://github.com/vjcitn/biocMultiAssay
>
> maybe Vince can confirm.
>
> robert.
>
> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
> > Oh, I don't disagree.  Perhaps the two problems can be addressed
> > simultaneously by
> >
> > 1) deciding on what contracts a multi-assay container can/would demand to
> > be useful
> > 2) calling it something besides SummarizedExperiment, say,
> > ExperimentCollection
> >
> > Then the SE API could stay the same as it is (which is already very
> useful)
> > and progress could be sought in the offshoot (ExperimentCollection or
> > whatever) without breaking things that rely on SE.
> >
> > Just off the top of my head, a most generically useful container for DNA
> > methylation&  CNV data (which can of course be called from the same
> assay)
> > is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
> > eSet backwards compatibility.  (e.g. sampleNames(x) works, but
> > sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
> > rowData(x))  There are little niggles that I should probably just send
> in a
> > patch for, but a cleaner overall container would be better, if for no
> other
> > reason than the aforementioned ability to easily experiment with
> > imputation. An approach that I've been using is to stuff the SNPs, CNV
> (as
> > GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
> > somewhat less than optimal, especially when subsetting.
> >
> > But it does suggest that I could define a coercion from the current
> > rambling wreck into a nice clean new class/API (ExperimentCollection or
> > whatever) and I'll bet other package authors could, too.  The presence
> of a
> > GRangesFrame would then be handy for returning a given assay's results,
> so
> > that the user could be blissfully ignorant of the storage backing (ff,
> > BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
> management
> > advantages of a SummarizedExperiment.
> >
> > JMHO
> >
> >
> >
> >
> >
> >
> >
> > Statistics is the grammar of science.
> > Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >
> > On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<stvjc at channing.harvard.edu
> >
> > wrote:
> >
> >>   I am a bit concerned about any major alterations to the
> >> SummarizedExperiment API.  We have
> >> two papers and plenty of working code that use it in meaningful ways.
> >> Effort required to keep new
> >> formulations back-compatible as well as bug-free has to be weighed
> >> seriously.
> >>
> >>   I agree that the name is not ideal.  We are learning as we go.
> >>
> >>   Seems to make sense to start with the contracts we want the instances
> of
> >> a class to satisfy.  I have long felt
> >> that X[i, j] idiom is one users and developers should be comfortable
> with,
> >> even insist on, and for consistency
> >> with matrix operations idiom, it should work in a natural way for
> numeric
> >> indexing.  This seems like an important
> >> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
> >> would adopt filter() for row-oriented selections
> >> and select() for column-oriented selections.  Do we have to make any
> >> special design considerations to allow
> >> very smooth interoperation with out-of-memory resources for certain
> >> components for developers who want to allow this?
> >>
> >>   We should have a reasonable way to get data on what is out there, what
> >> is used, how it is most effectively used.
> >> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
> >> killer packages that use/don't use it?
> >> Even getting data on the formal API for a class is not all that
> familiar.
> >> And if folks are writing non-S4 interfaces (i.e., naked
> >> functions) we have no way of identifying them.  See below for one way of
> >> discovering the API for SummarizedExperiment.
> >>
> >>   In summary, I think we have to be careful about overdesigning too
> >> early.  Getting clear on contracts seems the best
> >> way to ensure reuse, and we really want that so that reliability is
> >> continually assessed.  My sense is that it is good
> >> to give developers something they'll gladly extend, not necessarily
> reuse
> >> directly.  So we don't have to have
> >> broad consensus on class details, but on the minimal abstraction and on
> >> obligatory tests on its basic implementation.
> >>
> >>> methods(class="SummarizedExperiment")  # perhaps an obsolete version of
> >> methods cataloguer by MTM
> >>
> >> DataFrame with 76 rows and 3 columns
> >>
> >>           generic
> >>        signature       package
> >>
> >>       <character>
> >>      <character>    <character>
> >>
> >> 1              [                   x="SummarizedExperiment", i="ANY",
> >> j="ANY", drop="ANY"          base
> >>
> >> 2              [              x="SummarizedExperiment", i="ANY",
> >> j="missing", value="ANY"          base
> >>
> >> 3              [                           x="SummarizedExperiment",
> >> i="ANY", j="missing"          base
> >>
> >> 4            [<- x="SummarizedExperiment", i="ANY", j="ANY",
> >> value="SummarizedExperiment"          base
> >>
> >> 5          assay
> >> x="SummarizedExperiment", i="character" GenomicRanges
> >>
> >> ...          ...
> >>              ...           ...
> >>
> >> 72  updateObject
> >> object="SummarizedExperiment"  BiocGenerics
> >>
> >> 73        values
> >> x="SummarizedExperiment"     S4Vectors
> >>
> >> 74      values<-
> >> x="SummarizedExperiment"     S4Vectors
> >>
> >> 75         width
> >> x="SummarizedExperiment"  BiocGenerics
> >>
> >> 76       width<-
> >> x="SummarizedExperiment"  BiocGenerics
> >>
> >> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorrada at gmail.com
> >
> >> wrote:
> >>
> >>> May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices'
> can
> >>> return whatever makes sense (GRanges, or other data structures
> -thinking
> >>> taxonomy for metagenomics for example-). GRangesFrame can inherit from
> >>> this.
> >>>
> >>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpages at fredhutch.org>
> wrote:
> >>>
> >>>> GRangesFrame is an interesting idea and I gave it some thoughts.
> >>>>
> >>>> There is this nice symmetry between GRanges and GRangesFrame:
> >>>>
> >>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
> >>>>
> >>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
> >>>>                   some accessor (e.g. rowRanges())
> >>>>
> >>>> So GRanges and GRangesFrame are equivalent in terms of what they
> >>>> can hold, but different in terms of API: the former has the ranges
> >>>> API as primary API and the DataFrame API on its mcols() component,
> >>>> and the latter has the DataFrame API as primary API and the ranges
> >>>> API on its rowRanges() component. Nice switch!
> >>>>
> >>>> What does this API switch bring us? A GRangesFrame object is now
> >>>> an object that fully behaves like a DataFrame and people can also
> >>>> perform range-based operations on its rowRanges() component.
> >>>> Here is what I'm afraid is going to happen: people will also want
> >>>> to be able to perform range-based operations *directly* on
> >>>> these objects, i.e. without having to call rowRanges() first.
> >>>> So for example when they do subsetByOverlaps(), subsetting
> >>>> happens vertically. Also the Hits object returned by findOverlaps()
> >>>> would contain row indices. Problem with this is that these objects
> >>>> now start to suffer from the "dual personality syndrome". For
> >>>> example, it's not clear anymore what their length should be.
> >>>> Strictly speaking it should be their number of columns (that's
> >>>> what the length of a DataFrame is), but the ranges API that
> >>>> we're trying to put on them also makes them feel like vectors
> >>>> along the vertical dimension so it also feels that their length
> >>>> should be their number of rows. Same thing with 1D subsetting.
> >>>> Why does it subset the columns and not the rows? Most people
> >>>> are now confused.
> >>>>
> >>>> It's interesting to note that the same thing happens with GRanges
> >>>> objects, but in the opposite direction: people wish they could
> >>>> do DataFrame operations directly on them without calling mcols()
> >>>> first. But in order to preserve the good health of GRanges objects,
> >>>> we've not done that (except for $, a shortcut for mcols(x)$,
> >>>> the pressure was just too strong).
> >>>>
> >>>> H.
> >>>>
> >>>>
> >>>>
> >>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
> >>>>
> >>>>> Should be possible for the annotations to be of any type, as long as
> >>> they
> >>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have
> a
> >>>>> DataFrame, GRanges, or whatever in there. But it would be nice to
> have
> >>> a
> >>>>> special class for the container with range information. The contract
> >>> for
> >>>>> the range annotation would be to have a granges() method.
> >>>>>
> >>>>> I agree it would be nice if there was a way with the methods package
> to
> >>>>> easily assert such contracts. For example, one could define an
> >>> interface
> >>>>> with a set of generics (and optionally the relevant position in the
> >>>>> generic
> >>>>> signature). Then, once all of the methods have been assigned for a
> >>>>> particular class, it is made to inherit from that contract class.
> There
> >>>>> are
> >>>>> lots of gotchas though. Not sure how useful it would be in practice.
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.peter at gene.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>   There are some nice similarities in these new imaginary types.  A
> >>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
> >>> and
> >>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
> >>> list
> >>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
> >>> objects,
> >>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
> >>>>>> GRangesFrame).
> >>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
> >>>>>>
> >>>>>> I wonder if this kind of thing would be easier if we had Java-style
> >>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
> >>>>>> implements this set of methods ...
> >>>>>>
> >>>>>> Oh, and kinda apropos, the genoset class will probably go away or
> >>> become
> >>>>>> an extension to this new SE-like thing.  The extra stuff that comes
> >>> along
> >>>>>> with genoset will still be available.
> >>>>>>
> >>>>>> Pete
> >>>>>>
> >>>>>> ____________________
> >>>>>> Peter M. Haverty, Ph.D.
> >>>>>> Genentech, Inc.
> >>>>>> phaverty at gene.com
> >>>>>>
> >>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<
> tim.triche at gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>   This.
> >>>>>>>
> >>>>>>> It would be damned near perfect as a return value for assays coming
> >>> out
> >>>>>>> of
> >>>>>>> an object that held several such assays at several time points in a
> >>>>>>> population, where there are both assay-wise and covariate-wise
> >>> "holes"
> >>>>>>> that
> >>>>>>> could nonetheless be usefully imputed across assays.
> >>>>>>>
> >>>>>>>
> >>>>>>> Statistics is the grammar of science.
> >>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >>>>>>>
> >>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<
> >>> haverty.peter at gene.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    I still think GRanges should be a subclass of DataFrame,
> >>>>>>>>>
> >>>>>>>>>> which would make this easy, but I don't seem to be winning that
> >>>>>>>>>>
> >>>>>>>>> argument.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> Just impossible. As Michael mentioned back in November, they have
> >>>>>>>>> conflicting APIs.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
> >>>>>>>> (without mcols) as an index?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>           [[alternative HTML version deleted]]
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>           [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>          [[alternative HTML version deleted]]
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioc-devel at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>
> >>>>>
> >>>> --
> >>>> Hervé Pagès
> >>>>
> >>>> Program in Computational Biology
> >>>> Division of Public Health Sciences
> >>>> Fred Hutchinson Cancer Research Center
> >>>> 1100 Fairview Ave. N, M1-B514
> >>>> P.O. Box 19024
> >>>> Seattle, WA 98109-1024
> >>>>
> >>>> E-mail: hpages at fredhutch.org
> >>>> Phone:  (206) 667-5791
> >>>> Fax:    (206) 667-1319
> >>>>
> >>>> _______________________________________________
> >>>> Bioc-devel at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>
> >>>
> >>>          [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>
> >>
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> --
> Robert Castelo, PhD
> Associate Professor
> Dept. of Experimental and Health Sciences
> Universitat Pompeu Fabra (UPF)
> Barcelona Biomedical Research Park (PRBB)
> Dr Aiguader 88
> E-08003 Barcelona, Spain
> telf: +34.933.160.514
> fax: +34.933.160.550
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list