[Bioc-devel] Changes to the SummarizedExperiment Class
Tim Triche, Jr.
tim.triche at gmail.com
Wed Mar 4 18:32:37 CET 2015
My response was meant to address this:
1) fixed-dimension, fixed sample set is a solved problem, and SE is that
solution.
2) multi-assay, "holes" across samples remains an ugly thorny problem,
maybe needs a new API
So why not keep SE as stable as possible, and dump all the explosive
changes into the latter?
Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:
>
>
> On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo <robert.castelo at upf.edu>
> wrote:
>
>> some of the goals behind this discussion are IMO similar to the ones for
>> biocMultiAssay:
>>
>> https://github.com/vjcitn/biocMultiAssay
>>
>> maybe Vince can confirm.
>>
>
>
> It is true that there are connections between the concerns But the way I
> see it, the container design we
> are talking about in this thread addresses the management of a fixed
> common assay type over a fixed set of samples.
>
> The biocMultiAssay deals with the management of multiple assay types over
> multiple samples, with possible
> disparities in sample sets over the different assay types.
>
>
>
>> robert.
>>
>> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
>>
>>> Oh, I don't disagree. Perhaps the two problems can be addressed
>>> simultaneously by
>>>
>>> 1) deciding on what contracts a multi-assay container can/would demand to
>>> be useful
>>> 2) calling it something besides SummarizedExperiment, say,
>>> ExperimentCollection
>>>
>>> Then the SE API could stay the same as it is (which is already very
>>> useful)
>>> and progress could be sought in the offshoot (ExperimentCollection or
>>> whatever) without breaking things that rely on SE.
>>>
>>> Just off the top of my head, a most generically useful container for DNA
>>> methylation& CNV data (which can of course be called from the same
>>> assay)
>>> is Kasper& JP's GenomicRatioSet, which already has some weird quirks for
>>> eSet backwards compatibility. (e.g. sampleNames(x) works, but
>>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
>>> rowData(x)) There are little niggles that I should probably just send
>>> in a
>>> patch for, but a cleaner overall container would be better, if for no
>>> other
>>> reason than the aforementioned ability to easily experiment with
>>> imputation. An approach that I've been using is to stuff the SNPs, CNV
>>> (as
>>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is...
>>> somewhat less than optimal, especially when subsetting.
>>>
>>> But it does suggest that I could define a coercion from the current
>>> rambling wreck into a nice clean new class/API (ExperimentCollection or
>>> whatever) and I'll bet other package authors could, too. The presence
>>> of a
>>> GRangesFrame would then be handy for returning a given assay's results,
>>> so
>>> that the user could be blissfully ignorant of the storage backing (ff,
>>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
>>> management
>>> advantages of a SummarizedExperiment.
>>>
>>> JMHO
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Statistics is the grammar of science.
>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>
>>>
>>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<stvjc at channing.harvard.edu
>>> >
>>> wrote:
>>>
>>> I am a bit concerned about any major alterations to the
>>>> SummarizedExperiment API. We have
>>>> two papers and plenty of working code that use it in meaningful ways.
>>>> Effort required to keep new
>>>> formulations back-compatible as well as bug-free has to be weighed
>>>> seriously.
>>>>
>>>> I agree that the name is not ideal. We are learning as we go.
>>>>
>>>> Seems to make sense to start with the contracts we want the instances
>>>> of
>>>> a class to satisfy. I have long felt
>>>> that X[i, j] idiom is one users and developers should be comfortable
>>>> with,
>>>> even insist on, and for consistency
>>>> with matrix operations idiom, it should work in a natural way for
>>>> numeric
>>>> indexing. This seems like an important
>>>> constraint. subsetBy* is a useful idiom, but it is conceivable that we
>>>> would adopt filter() for row-oriented selections
>>>> and select() for column-oriented selections. Do we have to make any
>>>> special design considerations to allow
>>>> very smooth interoperation with out-of-memory resources for certain
>>>> components for developers who want to allow this?
>>>>
>>>> We should have a reasonable way to get data on what is out there, what
>>>> is used, how it is most effectively used.
>>>> What's the SE API? Is it well-adapted to requirements of DESeq2? Other
>>>> killer packages that use/don't use it?
>>>> Even getting data on the formal API for a class is not all that
>>>> familiar.
>>>> And if folks are writing non-S4 interfaces (i.e., naked
>>>> functions) we have no way of identifying them. See below for one way of
>>>> discovering the API for SummarizedExperiment.
>>>>
>>>> In summary, I think we have to be careful about overdesigning too
>>>> early. Getting clear on contracts seems the best
>>>> way to ensure reuse, and we really want that so that reliability is
>>>> continually assessed. My sense is that it is good
>>>> to give developers something they'll gladly extend, not necessarily
>>>> reuse
>>>> directly. So we don't have to have
>>>> broad consensus on class details, but on the minimal abstraction and on
>>>> obligatory tests on its basic implementation.
>>>>
>>>> methods(class="SummarizedExperiment") # perhaps an obsolete version
>>>>> of
>>>>>
>>>> methods cataloguer by MTM
>>>>
>>>> DataFrame with 76 rows and 3 columns
>>>>
>>>> generic
>>>> signature package
>>>>
>>>> <character>
>>>> <character> <character>
>>>>
>>>> 1 [ x="SummarizedExperiment", i="ANY",
>>>> j="ANY", drop="ANY" base
>>>>
>>>> 2 [ x="SummarizedExperiment", i="ANY",
>>>> j="missing", value="ANY" base
>>>>
>>>> 3 [ x="SummarizedExperiment",
>>>> i="ANY", j="missing" base
>>>>
>>>> 4 [<- x="SummarizedExperiment", i="ANY", j="ANY",
>>>> value="SummarizedExperiment" base
>>>>
>>>> 5 assay
>>>> x="SummarizedExperiment", i="character" GenomicRanges
>>>>
>>>> ... ...
>>>> ... ...
>>>>
>>>> 72 updateObject
>>>> object="SummarizedExperiment" BiocGenerics
>>>>
>>>> 73 values
>>>> x="SummarizedExperiment" S4Vectors
>>>>
>>>> 74 values<-
>>>> x="SummarizedExperiment" S4Vectors
>>>>
>>>> 75 width
>>>> x="SummarizedExperiment" BiocGenerics
>>>>
>>>> 76 width<-
>>>> x="SummarizedExperiment" BiocGenerics
>>>>
>>>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorrada at gmail.com
>>>> >
>>>> wrote:
>>>>
>>>> May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices'
>>>>> can
>>>>> return whatever makes sense (GRanges, or other data structures
>>>>> -thinking
>>>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from
>>>>> this.
>>>>>
>>>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpages at fredhutch.org>
>>>>> wrote:
>>>>>
>>>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>>>
>>>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>>>
>>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>>>
>>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>>> some accessor (e.g. rowRanges())
>>>>>>
>>>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>>>> can hold, but different in terms of API: the former has the ranges
>>>>>> API as primary API and the DataFrame API on its mcols() component,
>>>>>> and the latter has the DataFrame API as primary API and the ranges
>>>>>> API on its rowRanges() component. Nice switch!
>>>>>>
>>>>>> What does this API switch bring us? A GRangesFrame object is now
>>>>>> an object that fully behaves like a DataFrame and people can also
>>>>>> perform range-based operations on its rowRanges() component.
>>>>>> Here is what I'm afraid is going to happen: people will also want
>>>>>> to be able to perform range-based operations *directly* on
>>>>>> these objects, i.e. without having to call rowRanges() first.
>>>>>> So for example when they do subsetByOverlaps(), subsetting
>>>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>>>> would contain row indices. Problem with this is that these objects
>>>>>> now start to suffer from the "dual personality syndrome". For
>>>>>> example, it's not clear anymore what their length should be.
>>>>>> Strictly speaking it should be their number of columns (that's
>>>>>> what the length of a DataFrame is), but the ranges API that
>>>>>> we're trying to put on them also makes them feel like vectors
>>>>>> along the vertical dimension so it also feels that their length
>>>>>> should be their number of rows. Same thing with 1D subsetting.
>>>>>> Why does it subset the columns and not the rows? Most people
>>>>>> are now confused.
>>>>>>
>>>>>> It's interesting to note that the same thing happens with GRanges
>>>>>> objects, but in the opposite direction: people wish they could
>>>>>> do DataFrame operations directly on them without calling mcols()
>>>>>> first. But in order to preserve the good health of GRanges objects,
>>>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>>>> the pressure was just too strong).
>>>>>>
>>>>>> H.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>>>
>>>>>> Should be possible for the annotations to be of any type, as long as
>>>>>>>
>>>>>> they
>>>>>
>>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to
>>>>>>> have
>>>>>>>
>>>>>> a
>>>>>
>>>>>> special class for the container with range information. The contract
>>>>>>>
>>>>>> for
>>>>>
>>>>>> the range annotation would be to have a granges() method.
>>>>>>>
>>>>>>> I agree it would be nice if there was a way with the methods package
>>>>>>> to
>>>>>>> easily assert such contracts. For example, one could define an
>>>>>>>
>>>>>> interface
>>>>>
>>>>>> with a set of generics (and optionally the relevant position in the
>>>>>>> generic
>>>>>>> signature). Then, once all of the methods have been assigned for a
>>>>>>> particular class, it is made to inherit from that contract class.
>>>>>>> There
>>>>>>> are
>>>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.peter at gene.com
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>> There are some nice similarities in these new imaginary types. A
>>>>>>>
>>>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>> some row meta-data (the GRanges). The SE-like object is similarly a
>>>>>>>>
>>>>>>> list
>>>>>
>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>>>>>
>>>>>>> objects,
>>>>>
>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>>>> GRangesFrame).
>>>>>>>> Elegant? Maybe they would actually be relatives in the class tree.
>>>>>>>>
>>>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>>>> Interfaces or duck-typing. The "x" slot of "y" holds something that
>>>>>>>> implements this set of methods ...
>>>>>>>>
>>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>>>>>>>
>>>>>>> become
>>>>>
>>>>>> an extension to this new SE-like thing. The extra stuff that comes
>>>>>>>>
>>>>>>> along
>>>>>
>>>>>> with genoset will still be available.
>>>>>>>>
>>>>>>>> Pete
>>>>>>>>
>>>>>>>> ____________________
>>>>>>>> Peter M. Haverty, Ph.D.
>>>>>>>> Genentech, Inc.
>>>>>>>> phaverty at gene.com
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<
>>>>>>>> tim.triche at gmail.com
>>>>>>>>
>>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> This.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>>>>>
>>>>>>>> out
>>>>>
>>>>>> of
>>>>>>>>> an object that held several such assays at several time points in a
>>>>>>>>> population, where there are both assay-wise and covariate-wise
>>>>>>>>>
>>>>>>>> "holes"
>>>>>
>>>>>> that
>>>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Statistics is the grammar of science.
>>>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<
>>>>>>>>>
>>>>>>>> haverty.peter at gene.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>>>
>>>>>>>>>>> which would make this easy, but I don't seem to be winning that
>>>>>>>>>>>>
>>>>>>>>>>>> argument.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Just impossible. As Michael mentioned back in November, they
>>>>>>>>>>> have
>>>>>>>>>>> conflicting APIs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>>>> (without mcols) as an index?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fredhutch.org
>>>>>> Phone: (206) 667-5791
>>>>>> Fax: (206) 667-1319
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> Robert Castelo, PhD
>> Associate Professor
>> Dept. of Experimental and Health Sciences
>> Universitat Pompeu Fabra (UPF)
>> Barcelona Biomedical Research Park (PRBB)
>> Dr Aiguader 88
>> E-08003 Barcelona, Spain
>> telf: +34.933.160.514
>> fax: +34.933.160.550
>>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list