[Bioc-devel] Changes to the SummarizedExperiment Class

Wed Mar 4 18:01:22 CET 2015

some of the goals behind this discussion are IMO similar to the ones for 
biocMultiAssay:

https://github.com/vjcitn/biocMultiAssay

maybe Vince can confirm.

robert.

On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
> Oh, I don't disagree.  Perhaps the two problems can be addressed
> simultaneously by
>
> 1) deciding on what contracts a multi-assay container can/would demand to
> be useful
> 2) calling it something besides SummarizedExperiment, say,
> ExperimentCollection
>
> Then the SE API could stay the same as it is (which is already very useful)
> and progress could be sought in the offshoot (ExperimentCollection or
> whatever) without breaking things that rely on SE.
>
> Just off the top of my head, a most generically useful container for DNA
> methylation&  CNV data (which can of course be called from the same assay)
> is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
> eSet backwards compatibility.  (e.g. sampleNames(x) works, but
> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
> rowData(x))  There are little niggles that I should probably just send in a
> patch for, but a cleaner overall container would be better, if for no other
> reason than the aforementioned ability to easily experiment with
> imputation. An approach that I've been using is to stuff the SNPs, CNV (as
> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
> somewhat less than optimal, especially when subsetting.
>
> But it does suggest that I could define a coercion from the current
> rambling wreck into a nice clean new class/API (ExperimentCollection or
> whatever) and I'll bet other package authors could, too.  The presence of a
> GRangesFrame would then be handy for returning a given assay's results, so
> that the user could be blissfully ignorant of the storage backing (ff,
> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
> advantages of a SummarizedExperiment.
>
> JMHO
>
>
>
>
>
>
>
> Statistics is the grammar of science.
> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>
> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<stvjc at channing.harvard.edu>
> wrote:
>
>>   I am a bit concerned about any major alterations to the
>> SummarizedExperiment API.  We have
>> two papers and plenty of working code that use it in meaningful ways.
>> Effort required to keep new
>> formulations back-compatible as well as bug-free has to be weighed
>> seriously.
>>
>>   I agree that the name is not ideal.  We are learning as we go.
>>
>>   Seems to make sense to start with the contracts we want the instances of
>> a class to satisfy.  I have long felt
>> that X[i, j] idiom is one users and developers should be comfortable with,
>> even insist on, and for consistency
>> with matrix operations idiom, it should work in a natural way for numeric
>> indexing.  This seems like an important
>> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
>> would adopt filter() for row-oriented selections
>> and select() for column-oriented selections.  Do we have to make any
>> special design considerations to allow
>> very smooth interoperation with out-of-memory resources for certain
>> components for developers who want to allow this?
>>
>>   We should have a reasonable way to get data on what is out there, what
>> is used, how it is most effectively used.
>> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
>> killer packages that use/don't use it?
>> Even getting data on the formal API for a class is not all that familiar.
>> And if folks are writing non-S4 interfaces (i.e., naked
>> functions) we have no way of identifying them.  See below for one way of
>> discovering the API for SummarizedExperiment.
>>
>>   In summary, I think we have to be careful about overdesigning too
>> early.  Getting clear on contracts seems the best
>> way to ensure reuse, and we really want that so that reliability is
>> continually assessed.  My sense is that it is good
>> to give developers something they'll gladly extend, not necessarily reuse
>> directly.  So we don't have to have
>> broad consensus on class details, but on the minimal abstraction and on
>> obligatory tests on its basic implementation.
>>
>>> methods(class="SummarizedExperiment")  # perhaps an obsolete version of
>> methods cataloguer by MTM
>>
>> DataFrame with 76 rows and 3 columns
>>
>>           generic
>>        signature       package
>>
>>       <character>
>>      <character>    <character>
>>
>> 1              [                   x="SummarizedExperiment", i="ANY",
>> j="ANY", drop="ANY"          base
>>
>> 2              [              x="SummarizedExperiment", i="ANY",
>> j="missing", value="ANY"          base
>>
>> 3              [                           x="SummarizedExperiment",
>> i="ANY", j="missing"          base
>>
>> 4            [<- x="SummarizedExperiment", i="ANY", j="ANY",
>> value="SummarizedExperiment"          base
>>
>> 5          assay
>> x="SummarizedExperiment", i="character" GenomicRanges
>>
>> ...          ...
>>              ...           ...
>>
>> 72  updateObject
>> object="SummarizedExperiment"  BiocGenerics
>>
>> 73        values
>> x="SummarizedExperiment"     S4Vectors
>>
>> 74      values<-
>> x="SummarizedExperiment"     S4Vectors
>>
>> 75         width
>> x="SummarizedExperiment"  BiocGenerics
>>
>> 76       width<-
>> x="SummarizedExperiment"  BiocGenerics
>>
>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorrada at gmail.com>
>> wrote:
>>
>>> May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
>>> return whatever makes sense (GRanges, or other data structures -thinking
>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from
>>> this.
>>>
>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpages at fredhutch.org>  wrote:
>>>
>>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>
>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>
>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>
>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>                   some accessor (e.g. rowRanges())
>>>>
>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>> can hold, but different in terms of API: the former has the ranges
>>>> API as primary API and the DataFrame API on its mcols() component,
>>>> and the latter has the DataFrame API as primary API and the ranges
>>>> API on its rowRanges() component. Nice switch!
>>>>
>>>> What does this API switch bring us? A GRangesFrame object is now
>>>> an object that fully behaves like a DataFrame and people can also
>>>> perform range-based operations on its rowRanges() component.
>>>> Here is what I'm afraid is going to happen: people will also want
>>>> to be able to perform range-based operations *directly* on
>>>> these objects, i.e. without having to call rowRanges() first.
>>>> So for example when they do subsetByOverlaps(), subsetting
>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>> would contain row indices. Problem with this is that these objects
>>>> now start to suffer from the "dual personality syndrome". For
>>>> example, it's not clear anymore what their length should be.
>>>> Strictly speaking it should be their number of columns (that's
>>>> what the length of a DataFrame is), but the ranges API that
>>>> we're trying to put on them also makes them feel like vectors
>>>> along the vertical dimension so it also feels that their length
>>>> should be their number of rows. Same thing with 1D subsetting.
>>>> Why does it subset the columns and not the rows? Most people
>>>> are now confused.
>>>>
>>>> It's interesting to note that the same thing happens with GRanges
>>>> objects, but in the opposite direction: people wish they could
>>>> do DataFrame operations directly on them without calling mcols()
>>>> first. But in order to preserve the good health of GRanges objects,
>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>> the pressure was just too strong).
>>>>
>>>> H.
>>>>
>>>>
>>>>
>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>
>>>>> Should be possible for the annotations to be of any type, as long as
>>> they
>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to have
>>> a
>>>>> special class for the container with range information. The contract
>>> for
>>>>> the range annotation would be to have a granges() method.
>>>>>
>>>>> I agree it would be nice if there was a way with the methods package to
>>>>> easily assert such contracts. For example, one could define an
>>> interface
>>>>> with a set of generics (and optionally the relevant position in the
>>>>> generic
>>>>> signature). Then, once all of the methods have been assigned for a
>>>>> particular class, it is made to inherit from that contract class. There
>>>>> are
>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>>
>>>>>
>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.peter at gene.com>
>>>>> wrote:
>>>>>
>>>>>   There are some nice similarities in these new imaginary types.  A
>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
>>> and
>>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>> list
>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>> objects,
>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>> GRangesFrame).
>>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>>
>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>>> implements this set of methods ...
>>>>>>
>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>> become
>>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>> along
>>>>>> with genoset will still be available.
>>>>>>
>>>>>> Pete
>>>>>>
>>>>>> ____________________
>>>>>> Peter M. Haverty, Ph.D.
>>>>>> Genentech, Inc.
>>>>>> phaverty at gene.com
>>>>>>
>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<tim.triche at gmail.com
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>   This.
>>>>>>>
>>>>>>> It would be damned near perfect as a return value for assays coming
>>> out
>>>>>>> of
>>>>>>> an object that held several such assays at several time points in a
>>>>>>> population, where there are both assay-wise and covariate-wise
>>> "holes"
>>>>>>> that
>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>>
>>>>>>>
>>>>>>> Statistics is the grammar of science.
>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<
>>> haverty.peter at gene.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>
>>>>>>>>>> which would make this easy, but I don't seem to be winning that
>>>>>>>>>>
>>>>>>>>> argument.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Just impossible. As Michael mentioned back in November, they have
>>>>>>>>> conflicting APIs.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>> (without mcols) as an index?
>>>>>>>>
>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fredhutch.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550