[Bioc-devel] Changes to the SummarizedExperiment Class

Peter Haverty haverty.peter at gene.com
Wed Mar 4 19:03:32 CET 2015


Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

> I think we need to make sure that there are enough benefits of something
> like GRangesFrame before we introduce yet another complicated and
> overlapping data structure into the framework. Prior to summarization, the
> ranges seem primary, after summarization, it may often make sense for them
> to be secondary. But I'm just not sure what we gain from a new data
> structure.
>
> On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>
>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>
>> There is this nice symmetry between GRanges and GRangesFrame:
>>
>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>
>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>                  some accessor (e.g. rowRanges())
>>
>> So GRanges and GRangesFrame are equivalent in terms of what they
>> can hold, but different in terms of API: the former has the ranges
>> API as primary API and the DataFrame API on its mcols() component,
>> and the latter has the DataFrame API as primary API and the ranges
>> API on its rowRanges() component. Nice switch!
>>
>> What does this API switch bring us? A GRangesFrame object is now
>> an object that fully behaves like a DataFrame and people can also
>> perform range-based operations on its rowRanges() component.
>> Here is what I'm afraid is going to happen: people will also want
>> to be able to perform range-based operations *directly* on
>> these objects, i.e. without having to call rowRanges() first.
>> So for example when they do subsetByOverlaps(), subsetting
>> happens vertically. Also the Hits object returned by findOverlaps()
>> would contain row indices. Problem with this is that these objects
>> now start to suffer from the "dual personality syndrome". For
>> example, it's not clear anymore what their length should be.
>> Strictly speaking it should be their number of columns (that's
>> what the length of a DataFrame is), but the ranges API that
>> we're trying to put on them also makes them feel like vectors
>> along the vertical dimension so it also feels that their length
>> should be their number of rows. Same thing with 1D subsetting.
>> Why does it subset the columns and not the rows? Most people
>> are now confused.
>>
>> It's interesting to note that the same thing happens with GRanges
>> objects, but in the opposite direction: people wish they could
>> do DataFrame operations directly on them without calling mcols()
>> first. But in order to preserve the good health of GRanges objects,
>> we've not done that (except for $, a shortcut for mcols(x)$,
>> the pressure was just too strong).
>>
>> H.
>>
>>
>>
>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>
>>> Should be possible for the annotations to be of any type, as long as they
>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>>> special class for the container with range information. The contract for
>>> the range annotation would be to have a granges() method.
>>>
>>> I agree it would be nice if there was a way with the methods package to
>>> easily assert such contracts. For example, one could define an interface
>>> with a set of generics (and optionally the relevant position in the
>>> generic
>>> signature). Then, once all of the methods have been assigned for a
>>> particular class, it is made to inherit from that contract class. There
>>> are
>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>
>>>
>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com>
>>> wrote:
>>>
>>>  There are some nice similarities in these new imaginary types.  A
>>>> "GRangesFrame" is a list of dimensionally identical things (columns) and
>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>> list
>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>> objects,
>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>> GRangesFrame).
>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>
>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>> implements this set of methods ...
>>>>
>>>> Oh, and kinda apropos, the genoset class will probably go away or become
>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>> along
>>>> with genoset will still be available.
>>>>
>>>> Pete
>>>>
>>>> ____________________
>>>> Peter M. Haverty, Ph.D.
>>>> Genentech, Inc.
>>>> phaverty at gene.com
>>>>
>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>>> wrote:
>>>>
>>>>  This.
>>>>>
>>>>> It would be damned near perfect as a return value for assays coming
>>>>> out of
>>>>> an object that held several such assays at several time points in a
>>>>> population, where there are both assay-wise and covariate-wise "holes"
>>>>> that
>>>>> could nonetheless be usefully imputed across assays.
>>>>>
>>>>>
>>>>> Statistics is the grammar of science.
>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>
>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   I still think GRanges should be a subclass of DataFrame,
>>>>>>>
>>>>>>>> which would make this easy, but I don't seem to be winning that
>>>>>>>>
>>>>>>> argument.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>> Just impossible. As Michael mentioned back in November, they have
>>>>>>> conflicting APIs.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>> (without mcols) as an index?
>>>>>>
>>>>>>
>>>>>>          [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list