[Bioc-devel] Changes to the SummarizedExperiment Class

Martin Morgan mtmorgan at fredhutch.org
Wed Mar 4 21:03:16 CET 2015


On 03/04/2015 10:03 AM, Peter Haverty wrote:
> Michael has a good point. The complexity of the BioC universe of classes
> hurts our ability to attract new users. More classes would be a minus there
> ... but a small set of common, explicit APIs would simplify things.
> Rectangular things implement the matrix Interface.  :-) Deprecating old
> stuff, like eSet, might help more than it hurts, on the simplicity front.
>
> P.S. apropos of understanding this universe of classes, I *love* the
> methods(class=x) thing Vincent mentioned.

The current version, under R-devel, is at

   devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4")

   > methods(class="SummarizedExperiment")
    [1] [                 [[                [[<-              [<-
    [5] $                 $<-               assay             assay<-
    [9] assayNames        assayNames<-      assays            assays<-
   [13] cbind             coerce            colData           colData<-
   [17] compare           Compare           countOverlaps     coverage
   [21] dim               dimnames          dimnames<-        disjointBins
   [25] distance          distanceToNearest duplicated        elementMetadata
   [29] elementMetadata<- end               end<-             exptData
   [33] exptData<-        extractROWS       findOverlaps      flank
   [37] follow            granges           isDisjoint        mcols
   [41] mcols<-           narrow            nearest           order
   [45] overlapsAny       precede           ranges            ranges<-
   [49] rank              rbind             replaceROWS       resize
   [53] restrict          rowData           rowData<-         seqinfo
   [57] seqinfo<-         seqnames          shift             show
   [61] sort              split             start             start<-
   [65] strand            strand<-          subset            subsetByOverlaps
   [69] updateObject      values            values<-          width
   [73] width<-

   see ?"methods" for accessing help and source code

and

 > head(attr(methods(class="SummarizedExperiment"), "info"))
                                                              generic visible
[,SummarizedExperiment,ANY-method                                  [    TRUE
[[,SummarizedExperiment,ANY,missing-method                        [[    TRUE
[[<-,SummarizedExperiment,ANY,missing-method                    [[<-    TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method     [<-    TRUE
$,SummarizedExperiment-method                                      $    TRUE
$<-,SummarizedExperiment-method                                  $<-    TRUE
                                                              isS4          from
[,SummarizedExperiment,ANY-method                            TRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method                   TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method                 TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-method                                TRUE GenomicRanges
$<-,SummarizedExperiment-method                              TRUE GenomicRanges

Martin

>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <lawrence.michael at gene.com>
> wrote:
>
>> I think we need to make sure that there are enough benefits of something
>> like GRangesFrame before we introduce yet another complicated and
>> overlapping data structure into the framework. Prior to summarization, the
>> ranges seem primary, after summarization, it may often make sense for them
>> to be secondary. But I'm just not sure what we gain from a new data
>> structure.
>>
>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s <hpages at fredhutch.org> wrote:
>>
>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>
>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>
>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>
>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>                   some accessor (e.g. rowRanges())
>>>
>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>> can hold, but different in terms of API: the former has the ranges
>>> API as primary API and the DataFrame API on its mcols() component,
>>> and the latter has the DataFrame API as primary API and the ranges
>>> API on its rowRanges() component. Nice switch!
>>>
>>> What does this API switch bring us? A GRangesFrame object is now
>>> an object that fully behaves like a DataFrame and people can also
>>> perform range-based operations on its rowRanges() component.
>>> Here is what I'm afraid is going to happen: people will also want
>>> to be able to perform range-based operations *directly* on
>>> these objects, i.e. without having to call rowRanges() first.
>>> So for example when they do subsetByOverlaps(), subsetting
>>> happens vertically. Also the Hits object returned by findOverlaps()
>>> would contain row indices. Problem with this is that these objects
>>> now start to suffer from the "dual personality syndrome". For
>>> example, it's not clear anymore what their length should be.
>>> Strictly speaking it should be their number of columns (that's
>>> what the length of a DataFrame is), but the ranges API that
>>> we're trying to put on them also makes them feel like vectors
>>> along the vertical dimension so it also feels that their length
>>> should be their number of rows. Same thing with 1D subsetting.
>>> Why does it subset the columns and not the rows? Most people
>>> are now confused.
>>>
>>> It's interesting to note that the same thing happens with GRanges
>>> objects, but in the opposite direction: people wish they could
>>> do DataFrame operations directly on them without calling mcols()
>>> first. But in order to preserve the good health of GRanges objects,
>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>> the pressure was just too strong).
>>>
>>> H.
>>>
>>>
>>>
>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>
>>>> Should be possible for the annotations to be of any type, as long as they
>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>>>> special class for the container with range information. The contract for
>>>> the range annotation would be to have a granges() method.
>>>>
>>>> I agree it would be nice if there was a way with the methods package to
>>>> easily assert such contracts. For example, one could define an interface
>>>> with a set of generics (and optionally the relevant position in the
>>>> generic
>>>> signature). Then, once all of the methods have been assigned for a
>>>> particular class, it is made to inherit from that contract class. There
>>>> are
>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>
>>>>
>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com>
>>>> wrote:
>>>>
>>>>   There are some nice similarities in these new imaginary types.  A
>>>>> "GRangesFrame" is a list of dimensionally identical things (columns) and
>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>>> list
>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>> objects,
>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>> GRangesFrame).
>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>
>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>> implements this set of methods ...
>>>>>
>>>>> Oh, and kinda apropos, the genoset class will probably go away or become
>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>>> along
>>>>> with genoset will still be available.
>>>>>
>>>>> Pete
>>>>>
>>>>> ____________________
>>>>> Peter M. Haverty, Ph.D.
>>>>> Genentech, Inc.
>>>>> phaverty at gene.com
>>>>>
>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>>>> wrote:
>>>>>
>>>>>   This.
>>>>>>
>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>> out of
>>>>>> an object that held several such assays at several time points in a
>>>>>> population, where there are both assay-wise and covariate-wise "holes"
>>>>>> that
>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>
>>>>>>
>>>>>> Statistics is the grammar of science.
>>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>
>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    I still think GRanges should be a subclass of DataFrame,
>>>>>>>>
>>>>>>>>> which would make this easy, but I don't seem to be winning that
>>>>>>>>>
>>>>>>>> argument.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>> Just impossible. As Michael mentioned back in November, they have
>>>>>>>> conflicting APIs.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>> (without mcols) as an index?
>>>>>>>
>>>>>>>
>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>> --
>>> Herv� Pag�s
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fredhutch.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list