[Bioc-devel] Changes to the SummarizedExperiment Class

Vincent Carey stvjc at channing.harvard.edu
Wed Mar 4 18:12:58 CET 2015


On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:

> some of the goals behind this discussion are IMO similar to the ones for
> biocMultiAssay:
>
> https://github.com/vjcitn/biocMultiAssay
>
> maybe Vince can confirm.
>


It is true that there are connections between the concerns  But the way I
see it, the container design we
are talking about in this thread addresses the management of a fixed common
assay type over a fixed set of samples.

The biocMultiAssay deals with the management of multiple assay types over
multiple samples, with possible
disparities in sample sets over the different assay types.



> robert.
>
> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
>
>> Oh, I don't disagree.  Perhaps the two problems can be addressed
>> simultaneously by
>>
>> 1) deciding on what contracts a multi-assay container can/would demand to
>> be useful
>> 2) calling it something besides SummarizedExperiment, say,
>> ExperimentCollection
>>
>> Then the SE API could stay the same as it is (which is already very
>> useful)
>> and progress could be sought in the offshoot (ExperimentCollection or
>> whatever) without breaking things that rely on SE.
>>
>> Just off the top of my head, a most generically useful container for DNA
>> methylation&  CNV data (which can of course be called from the same assay)
>> is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
>> eSet backwards compatibility.  (e.g. sampleNames(x) works, but
>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
>> rowData(x))  There are little niggles that I should probably just send in
>> a
>> patch for, but a cleaner overall container would be better, if for no
>> other
>> reason than the aforementioned ability to easily experiment with
>> imputation. An approach that I've been using is to stuff the SNPs, CNV (as
>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
>> somewhat less than optimal, especially when subsetting.
>>
>> But it does suggest that I could define a coercion from the current
>> rambling wreck into a nice clean new class/API (ExperimentCollection or
>> whatever) and I'll bet other package authors could, too.  The presence of
>> a
>> GRangesFrame would then be handy for returning a given assay's results, so
>> that the user could be blissfully ignorant of the storage backing (ff,
>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
>> advantages of a SummarizedExperiment.
>>
>> JMHO
>>
>>
>>
>>
>>
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>
>>
>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<stvjc at channing.harvard.edu>
>> wrote:
>>
>>    I am a bit concerned about any major alterations to the
>>> SummarizedExperiment API.  We have
>>> two papers and plenty of working code that use it in meaningful ways.
>>> Effort required to keep new
>>> formulations back-compatible as well as bug-free has to be weighed
>>> seriously.
>>>
>>>   I agree that the name is not ideal.  We are learning as we go.
>>>
>>>   Seems to make sense to start with the contracts we want the instances
>>> of
>>> a class to satisfy.  I have long felt
>>> that X[i, j] idiom is one users and developers should be comfortable
>>> with,
>>> even insist on, and for consistency
>>> with matrix operations idiom, it should work in a natural way for numeric
>>> indexing.  This seems like an important
>>> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
>>> would adopt filter() for row-oriented selections
>>> and select() for column-oriented selections.  Do we have to make any
>>> special design considerations to allow
>>> very smooth interoperation with out-of-memory resources for certain
>>> components for developers who want to allow this?
>>>
>>>   We should have a reasonable way to get data on what is out there, what
>>> is used, how it is most effectively used.
>>> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
>>> killer packages that use/don't use it?
>>> Even getting data on the formal API for a class is not all that familiar.
>>> And if folks are writing non-S4 interfaces (i.e., naked
>>> functions) we have no way of identifying them.  See below for one way of
>>> discovering the API for SummarizedExperiment.
>>>
>>>   In summary, I think we have to be careful about overdesigning too
>>> early.  Getting clear on contracts seems the best
>>> way to ensure reuse, and we really want that so that reliability is
>>> continually assessed.  My sense is that it is good
>>> to give developers something they'll gladly extend, not necessarily reuse
>>> directly.  So we don't have to have
>>> broad consensus on class details, but on the minimal abstraction and on
>>> obligatory tests on its basic implementation.
>>>
>>>  methods(class="SummarizedExperiment")  # perhaps an obsolete version of
>>>>
>>> methods cataloguer by MTM
>>>
>>> DataFrame with 76 rows and 3 columns
>>>
>>>           generic
>>>        signature       package
>>>
>>>       <character>
>>>      <character>    <character>
>>>
>>> 1              [                   x="SummarizedExperiment", i="ANY",
>>> j="ANY", drop="ANY"          base
>>>
>>> 2              [              x="SummarizedExperiment", i="ANY",
>>> j="missing", value="ANY"          base
>>>
>>> 3              [                           x="SummarizedExperiment",
>>> i="ANY", j="missing"          base
>>>
>>> 4            [<- x="SummarizedExperiment", i="ANY", j="ANY",
>>> value="SummarizedExperiment"          base
>>>
>>> 5          assay
>>> x="SummarizedExperiment", i="character" GenomicRanges
>>>
>>> ...          ...
>>>              ...           ...
>>>
>>> 72  updateObject
>>> object="SummarizedExperiment"  BiocGenerics
>>>
>>> 73        values
>>> x="SummarizedExperiment"     S4Vectors
>>>
>>> 74      values<-
>>> x="SummarizedExperiment"     S4Vectors
>>>
>>> 75         width
>>> x="SummarizedExperiment"  BiocGenerics
>>>
>>> 76       width<-
>>> x="SummarizedExperiment"  BiocGenerics
>>>
>>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorrada at gmail.com>
>>> wrote:
>>>
>>>  May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices'
>>>> can
>>>> return whatever makes sense (GRanges, or other data structures -thinking
>>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from
>>>> this.
>>>>
>>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpages at fredhutch.org>
>>>> wrote:
>>>>
>>>>  GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>>
>>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>>
>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>>
>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>>                   some accessor (e.g. rowRanges())
>>>>>
>>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>>> can hold, but different in terms of API: the former has the ranges
>>>>> API as primary API and the DataFrame API on its mcols() component,
>>>>> and the latter has the DataFrame API as primary API and the ranges
>>>>> API on its rowRanges() component. Nice switch!
>>>>>
>>>>> What does this API switch bring us? A GRangesFrame object is now
>>>>> an object that fully behaves like a DataFrame and people can also
>>>>> perform range-based operations on its rowRanges() component.
>>>>> Here is what I'm afraid is going to happen: people will also want
>>>>> to be able to perform range-based operations *directly* on
>>>>> these objects, i.e. without having to call rowRanges() first.
>>>>> So for example when they do subsetByOverlaps(), subsetting
>>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>>> would contain row indices. Problem with this is that these objects
>>>>> now start to suffer from the "dual personality syndrome". For
>>>>> example, it's not clear anymore what their length should be.
>>>>> Strictly speaking it should be their number of columns (that's
>>>>> what the length of a DataFrame is), but the ranges API that
>>>>> we're trying to put on them also makes them feel like vectors
>>>>> along the vertical dimension so it also feels that their length
>>>>> should be their number of rows. Same thing with 1D subsetting.
>>>>> Why does it subset the columns and not the rows? Most people
>>>>> are now confused.
>>>>>
>>>>> It's interesting to note that the same thing happens with GRanges
>>>>> objects, but in the opposite direction: people wish they could
>>>>> do DataFrame operations directly on them without calling mcols()
>>>>> first. But in order to preserve the good health of GRanges objects,
>>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>>> the pressure was just too strong).
>>>>>
>>>>> H.
>>>>>
>>>>>
>>>>>
>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>>
>>>>>  Should be possible for the annotations to be of any type, as long as
>>>>>>
>>>>> they
>>>>
>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to have
>>>>>>
>>>>> a
>>>>
>>>>> special class for the container with range information. The contract
>>>>>>
>>>>> for
>>>>
>>>>> the range annotation would be to have a granges() method.
>>>>>>
>>>>>> I agree it would be nice if there was a way with the methods package
>>>>>> to
>>>>>> easily assert such contracts. For example, one could define an
>>>>>>
>>>>> interface
>>>>
>>>>> with a set of generics (and optionally the relevant position in the
>>>>>> generic
>>>>>> signature). Then, once all of the methods have been assigned for a
>>>>>> particular class, it is made to inherit from that contract class.
>>>>>> There
>>>>>> are
>>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.peter at gene.com>
>>>>>> wrote:
>>>>>>
>>>>>>   There are some nice similarities in these new imaginary types.  A
>>>>>>
>>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
>>>>>>>
>>>>>> and
>>>>
>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>>>>>
>>>>>> list
>>>>
>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>>>>
>>>>>> objects,
>>>>
>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>>> GRangesFrame).
>>>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>>>
>>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>>>> implements this set of methods ...
>>>>>>>
>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>>>>>>
>>>>>> become
>>>>
>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>>>>>
>>>>>> along
>>>>
>>>>> with genoset will still be available.
>>>>>>>
>>>>>>> Pete
>>>>>>>
>>>>>>> ____________________
>>>>>>> Peter M. Haverty, Ph.D.
>>>>>>> Genentech, Inc.
>>>>>>> phaverty at gene.com
>>>>>>>
>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<tim.triche at gmail.com
>>>>>>>
>>>>>>
>>>>>  wrote:
>>>>>>>
>>>>>>>   This.
>>>>>>>
>>>>>>>>
>>>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>>>>
>>>>>>> out
>>>>
>>>>> of
>>>>>>>> an object that held several such assays at several time points in a
>>>>>>>> population, where there are both assay-wise and covariate-wise
>>>>>>>>
>>>>>>> "holes"
>>>>
>>>>> that
>>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>>>
>>>>>>>>
>>>>>>>> Statistics is the grammar of science.
>>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<
>>>>>>>>
>>>>>>> haverty.peter at gene.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>>
>>>>>>>>>>  which would make this easy, but I don't seem to be winning that
>>>>>>>>>>>
>>>>>>>>>>>  argument.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  Just impossible. As Michael mentioned back in November, they
>>>>>>>>>> have
>>>>>>>>>> conflicting APIs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>>> (without mcols) as an index?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>            [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>>  --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fredhutch.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Robert Castelo, PhD
> Associate Professor
> Dept. of Experimental and Health Sciences
> Universitat Pompeu Fabra (UPF)
> Barcelona Biomedical Research Park (PRBB)
> Dr Aiguader 88
> E-08003 Barcelona, Spain
> telf: +34.933.160.514
> fax: +34.933.160.550
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list