[Bioc-devel] Changes to the SummarizedExperiment Class

Tim Triche, Jr. tim.triche at gmail.com
Wed Mar 4 18:32:37 CET 2015


My response was meant to address this:

1) fixed-dimension, fixed sample set is a solved problem, and SE is that
solution.
2) multi-assay, "holes" across samples remains an ugly thorny problem,
maybe needs a new API

So why not keep SE as stable as possible, and dump all the explosive
changes into the latter?


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

>
>
> On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo <robert.castelo at upf.edu>
> wrote:
>
>> some of the goals behind this discussion are IMO similar to the ones for
>> biocMultiAssay:
>>
>> https://github.com/vjcitn/biocMultiAssay
>>
>> maybe Vince can confirm.
>>
>
>
> It is true that there are connections between the concerns  But the way I
> see it, the container design we
> are talking about in this thread addresses the management of a fixed
> common assay type over a fixed set of samples.
>
> The biocMultiAssay deals with the management of multiple assay types over
> multiple samples, with possible
> disparities in sample sets over the different assay types.
>
>
>
>> robert.
>>
>> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
>>
>>> Oh, I don't disagree.  Perhaps the two problems can be addressed
>>> simultaneously by
>>>
>>> 1) deciding on what contracts a multi-assay container can/would demand to
>>> be useful
>>> 2) calling it something besides SummarizedExperiment, say,
>>> ExperimentCollection
>>>
>>> Then the SE API could stay the same as it is (which is already very
>>> useful)
>>> and progress could be sought in the offshoot (ExperimentCollection or
>>> whatever) without breaking things that rely on SE.
>>>
>>> Just off the top of my head, a most generically useful container for DNA
>>> methylation&  CNV data (which can of course be called from the same
>>> assay)
>>> is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
>>> eSet backwards compatibility.  (e.g. sampleNames(x) works, but
>>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
>>> rowData(x))  There are little niggles that I should probably just send
>>> in a
>>> patch for, but a cleaner overall container would be better, if for no
>>> other
>>> reason than the aforementioned ability to easily experiment with
>>> imputation. An approach that I've been using is to stuff the SNPs, CNV
>>> (as
>>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
>>> somewhat less than optimal, especially when subsetting.
>>>
>>> But it does suggest that I could define a coercion from the current
>>> rambling wreck into a nice clean new class/API (ExperimentCollection or
>>> whatever) and I'll bet other package authors could, too.  The presence
>>> of a
>>> GRangesFrame would then be handy for returning a given assay's results,
>>> so
>>> that the user could be blissfully ignorant of the storage backing (ff,
>>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
>>> management
>>> advantages of a SummarizedExperiment.
>>>
>>> JMHO
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Statistics is the grammar of science.
>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>
>>>
>>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<stvjc at channing.harvard.edu
>>> >
>>> wrote:
>>>
>>>    I am a bit concerned about any major alterations to the
>>>> SummarizedExperiment API.  We have
>>>> two papers and plenty of working code that use it in meaningful ways.
>>>> Effort required to keep new
>>>> formulations back-compatible as well as bug-free has to be weighed
>>>> seriously.
>>>>
>>>>   I agree that the name is not ideal.  We are learning as we go.
>>>>
>>>>   Seems to make sense to start with the contracts we want the instances
>>>> of
>>>> a class to satisfy.  I have long felt
>>>> that X[i, j] idiom is one users and developers should be comfortable
>>>> with,
>>>> even insist on, and for consistency
>>>> with matrix operations idiom, it should work in a natural way for
>>>> numeric
>>>> indexing.  This seems like an important
>>>> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
>>>> would adopt filter() for row-oriented selections
>>>> and select() for column-oriented selections.  Do we have to make any
>>>> special design considerations to allow
>>>> very smooth interoperation with out-of-memory resources for certain
>>>> components for developers who want to allow this?
>>>>
>>>>   We should have a reasonable way to get data on what is out there, what
>>>> is used, how it is most effectively used.
>>>> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
>>>> killer packages that use/don't use it?
>>>> Even getting data on the formal API for a class is not all that
>>>> familiar.
>>>> And if folks are writing non-S4 interfaces (i.e., naked
>>>> functions) we have no way of identifying them.  See below for one way of
>>>> discovering the API for SummarizedExperiment.
>>>>
>>>>   In summary, I think we have to be careful about overdesigning too
>>>> early.  Getting clear on contracts seems the best
>>>> way to ensure reuse, and we really want that so that reliability is
>>>> continually assessed.  My sense is that it is good
>>>> to give developers something they'll gladly extend, not necessarily
>>>> reuse
>>>> directly.  So we don't have to have
>>>> broad consensus on class details, but on the minimal abstraction and on
>>>> obligatory tests on its basic implementation.
>>>>
>>>>  methods(class="SummarizedExperiment")  # perhaps an obsolete version
>>>>> of
>>>>>
>>>> methods cataloguer by MTM
>>>>
>>>> DataFrame with 76 rows and 3 columns
>>>>
>>>>           generic
>>>>        signature       package
>>>>
>>>>       <character>
>>>>      <character>    <character>
>>>>
>>>> 1              [                   x="SummarizedExperiment", i="ANY",
>>>> j="ANY", drop="ANY"          base
>>>>
>>>> 2              [              x="SummarizedExperiment", i="ANY",
>>>> j="missing", value="ANY"          base
>>>>
>>>> 3              [                           x="SummarizedExperiment",
>>>> i="ANY", j="missing"          base
>>>>
>>>> 4            [<- x="SummarizedExperiment", i="ANY", j="ANY",
>>>> value="SummarizedExperiment"          base
>>>>
>>>> 5          assay
>>>> x="SummarizedExperiment", i="character" GenomicRanges
>>>>
>>>> ...          ...
>>>>              ...           ...
>>>>
>>>> 72  updateObject
>>>> object="SummarizedExperiment"  BiocGenerics
>>>>
>>>> 73        values
>>>> x="SummarizedExperiment"     S4Vectors
>>>>
>>>> 74      values<-
>>>> x="SummarizedExperiment"     S4Vectors
>>>>
>>>> 75         width
>>>> x="SummarizedExperiment"  BiocGenerics
>>>>
>>>> 76       width<-
>>>> x="SummarizedExperiment"  BiocGenerics
>>>>
>>>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorrada at gmail.com
>>>> >
>>>> wrote:
>>>>
>>>>  May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices'
>>>>> can
>>>>> return whatever makes sense (GRanges, or other data structures
>>>>> -thinking
>>>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from
>>>>> this.
>>>>>
>>>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpages at fredhutch.org>
>>>>> wrote:
>>>>>
>>>>>  GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>>>
>>>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>>>
>>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>>>
>>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>>>                   some accessor (e.g. rowRanges())
>>>>>>
>>>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>>>> can hold, but different in terms of API: the former has the ranges
>>>>>> API as primary API and the DataFrame API on its mcols() component,
>>>>>> and the latter has the DataFrame API as primary API and the ranges
>>>>>> API on its rowRanges() component. Nice switch!
>>>>>>
>>>>>> What does this API switch bring us? A GRangesFrame object is now
>>>>>> an object that fully behaves like a DataFrame and people can also
>>>>>> perform range-based operations on its rowRanges() component.
>>>>>> Here is what I'm afraid is going to happen: people will also want
>>>>>> to be able to perform range-based operations *directly* on
>>>>>> these objects, i.e. without having to call rowRanges() first.
>>>>>> So for example when they do subsetByOverlaps(), subsetting
>>>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>>>> would contain row indices. Problem with this is that these objects
>>>>>> now start to suffer from the "dual personality syndrome". For
>>>>>> example, it's not clear anymore what their length should be.
>>>>>> Strictly speaking it should be their number of columns (that's
>>>>>> what the length of a DataFrame is), but the ranges API that
>>>>>> we're trying to put on them also makes them feel like vectors
>>>>>> along the vertical dimension so it also feels that their length
>>>>>> should be their number of rows. Same thing with 1D subsetting.
>>>>>> Why does it subset the columns and not the rows? Most people
>>>>>> are now confused.
>>>>>>
>>>>>> It's interesting to note that the same thing happens with GRanges
>>>>>> objects, but in the opposite direction: people wish they could
>>>>>> do DataFrame operations directly on them without calling mcols()
>>>>>> first. But in order to preserve the good health of GRanges objects,
>>>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>>>> the pressure was just too strong).
>>>>>>
>>>>>> H.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>  Should be possible for the annotations to be of any type, as long as
>>>>>>>
>>>>>> they
>>>>>
>>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to
>>>>>>> have
>>>>>>>
>>>>>> a
>>>>>
>>>>>> special class for the container with range information. The contract
>>>>>>>
>>>>>> for
>>>>>
>>>>>> the range annotation would be to have a granges() method.
>>>>>>>
>>>>>>> I agree it would be nice if there was a way with the methods package
>>>>>>> to
>>>>>>> easily assert such contracts. For example, one could define an
>>>>>>>
>>>>>> interface
>>>>>
>>>>>> with a set of generics (and optionally the relevant position in the
>>>>>>> generic
>>>>>>> signature). Then, once all of the methods have been assigned for a
>>>>>>> particular class, it is made to inherit from that contract class.
>>>>>>> There
>>>>>>> are
>>>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.peter at gene.com
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>>   There are some nice similarities in these new imaginary types.  A
>>>>>>>
>>>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>>>>>>
>>>>>>> list
>>>>>
>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>>>>>
>>>>>>> objects,
>>>>>
>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>>>> GRangesFrame).
>>>>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>>>>
>>>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>>>>> implements this set of methods ...
>>>>>>>>
>>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>>>>>>>
>>>>>>> become
>>>>>
>>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>>>>>>
>>>>>>> along
>>>>>
>>>>>> with genoset will still be available.
>>>>>>>>
>>>>>>>> Pete
>>>>>>>>
>>>>>>>> ____________________
>>>>>>>> Peter M. Haverty, Ph.D.
>>>>>>>> Genentech, Inc.
>>>>>>>> phaverty at gene.com
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<
>>>>>>>> tim.triche at gmail.com
>>>>>>>>
>>>>>>>
>>>>>>  wrote:
>>>>>>>>
>>>>>>>>   This.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>>>>>
>>>>>>>> out
>>>>>
>>>>>> of
>>>>>>>>> an object that held several such assays at several time points in a
>>>>>>>>> population, where there are both assay-wise and covariate-wise
>>>>>>>>>
>>>>>>>> "holes"
>>>>>
>>>>>> that
>>>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Statistics is the grammar of science.
>>>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<
>>>>>>>>>
>>>>>>>> haverty.peter at gene.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>>>
>>>>>>>>>>>  which would make this easy, but I don't seem to be winning that
>>>>>>>>>>>>
>>>>>>>>>>>>  argument.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  Just impossible. As Michael mentioned back in November, they
>>>>>>>>>>> have
>>>>>>>>>>> conflicting APIs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>>>> (without mcols) as an index?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>            [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fredhutch.org
>>>>>> Phone:  (206) 667-5791
>>>>>> Fax:    (206) 667-1319
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> Robert Castelo, PhD
>> Associate Professor
>> Dept. of Experimental and Health Sciences
>> Universitat Pompeu Fabra (UPF)
>> Barcelona Biomedical Research Park (PRBB)
>> Dr Aiguader 88
>> E-08003 Barcelona, Spain
>> telf: +34.933.160.514
>> fax: +34.933.160.550
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list