[Bioc-devel] Changes to the SummarizedExperiment Class

Mon Mar 9 17:31:22 CET 2015

On 03/09/2015 07:06 AM, Kasper Daniel Hansen wrote:
> It sounds like the proposed changes are already made.  However (like
> others) I am still a bit mystified why this was necessary.  The old version
> did allow for a GRanges inside the DataFrame of the rowData, as far as I
> recall.  So I assume this is for efficiency.  But why?  What kind of
> data/use cases is this for?
>
> I am happy to hear that SummarizedExperiment is going to be spun out into
> its own package.  When that happens, I have some comments, which I'll
> include here in anticipation
>    1) I now very strongly believe it was a design mistake to not have
> colnames on the assays.  The advantage of this choice is that sampleNames
> are only stored one place.  The extreme disadvantage is the high
> ineffeciency when you want colnames on an extracted assay.
>    2) I still strongly believe we should support pData, sampleNames etc etc
> on SummarizedExperiments.

I'm not keen on this 'backward compatibility' layer, or introducing functions 
with duplicate functionality, even if their implementation is just a 'one 
liner'; use rownames, colData, etc.

>    3) Having developed a package (minfi) where eSets co-exists with
> SummarizedExperiment, I have to mention that for the developer there is a
> number of places where the different internals of these two classes makes
> like irritating.  For this reason I would support a "modern" implementation
> of eSet, in parallel with SummarizedExperiment.

Yes, the intention is that a SummarizedExperiment (sub) class with rowData() 
being a DataFrame would be a replacement for eSet.

I don't think you were suggesting that eSet itself should be modernized; it has 
a lot of historical baggage.

Martin

>
> Best,
> Kasper
>
> On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain <vobencha at fredhutch.org>
> wrote:
>
>> Hi Mike,
>>
>> Our error - we didn't bump GenomicRanges when rowRanges was added.
>> Hopefully 1.19.43 will propagate today and things will be sorted out.
>>
>> Val
>>
>>
>> On 03/06/2015 07:40 AM, Michael Love wrote:
>>
>>> hi all,
>>>
>>> just a practical issue: I have GenomicRanges version 1.19.42 on my
>>> computer which does not have rowRanges defined, although the 1.19.42
>>> version on the Bioc website does have rowRanges in the man page:
>>>
>>> http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html
>>>
>>> So I pass check locally but not in the devel branch on Bioc servers.
>>>
>>>   library(GenomicRanges)
>>>> rowRanges
>>>>
>>> Error: object 'rowRanges' not found
>>>
>>>> sessionInfo()
>>>>
>>> R Under development (unstable) (2014-12-08 r67137)
>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>>>      methods   base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
>>> S4Vectors_0.5.21
>>> [5] BiocGenerics_0.13.6   RUnit_0.4.28          devtools_1.7.0
>>> knitr_1.9
>>> [9] BiocInstaller_1.17.5
>>>
>>>
>>>
>>> On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan <mtmorgan at fredhutch.org>
>>> wrote:
>>>
>>>>
>>>> On 03/04/2015 10:03 AM, Peter Haverty wrote:
>>>>
>>>>>
>>>>> Michael has a good point. The complexity of the BioC universe of classes
>>>>> hurts our ability to attract new users. More classes would be a minus
>>>>> there
>>>>> ... but a small set of common, explicit APIs would simplify things.
>>>>> Rectangular things implement the matrix Interface.  :-) Deprecating old
>>>>> stuff, like eSet, might help more than it hurts, on the simplicity
>>>>> front.
>>>>>
>>>>> P.S. apropos of understanding this universe of classes, I *love* the
>>>>> methods(class=x) thing Vincent mentioned.
>>>>>
>>>>
>>>>
>>>> The current version, under R-devel, is at
>>>>
>>>>     devtools::source_gist("https://gist.github.com/mtmorgan/
>>>> 9f98871adb9f0c1891a4")
>>>>
>>>>     > methods(class="SummarizedExperiment")
>>>>      [1] [                 [[                [[<-              [<-
>>>>      [5] $                 $<-               assay             assay<-
>>>>      [9] assayNames        assayNames<-      assays            assays<-
>>>>     [13] cbind             coerce            colData           colData<-
>>>>     [17] compare           Compare           countOverlaps     coverage
>>>>     [21] dim               dimnames          dimnames<-
>>>> disjointBins
>>>>     [25] distance          distanceToNearest duplicated
>>>> elementMetadata
>>>>     [29] elementMetadata<- end               end<-             exptData
>>>>     [33] exptData<-        extractROWS       findOverlaps      flank
>>>>     [37] follow            granges           isDisjoint        mcols
>>>>     [41] mcols<-           narrow            nearest           order
>>>>     [45] overlapsAny       precede           ranges            ranges<-
>>>>     [49] rank              rbind             replaceROWS       resize
>>>>     [53] restrict          rowData           rowData<-         seqinfo
>>>>     [57] seqinfo<-         seqnames          shift             show
>>>>     [61] sort              split             start             start<-
>>>>     [65] strand            strand<-          subset
>>>> subsetByOverlaps
>>>>     [69] updateObject      values            values<-          width
>>>>     [73] width<-
>>>>
>>>>     see ?"methods" for accessing help and source code
>>>>
>>>> and
>>>>
>>>>   head(attr(methods(class="SummarizedExperiment"), "info"))
>>>>>
>>>>                                                                generic
>>>> visible
>>>> [,SummarizedExperiment,ANY-method                                  [
>>>> TRUE
>>>> [[,SummarizedExperiment,ANY,missing-method                        [[
>>>> TRUE
>>>> [[<-,SummarizedExperiment,ANY,missing-method                    [[<-
>>>> TRUE
>>>> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method     [<-
>>>> TRUE
>>>> $,SummarizedExperiment-method                                      $
>>>> TRUE
>>>> $<-,SummarizedExperiment-method                                  $<-
>>>> TRUE
>>>>                                                                isS4
>>>>      from
>>>> [,SummarizedExperiment,ANY-method                            TRUE
>>>> GenomicRanges
>>>> [[,SummarizedExperiment,ANY,missing-method                   TRUE
>>>> GenomicRanges
>>>> [[<-,SummarizedExperiment,ANY,missing-method                 TRUE
>>>> GenomicRanges
>>>> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE
>>>> GenomicRanges
>>>> $,SummarizedExperiment-method                                TRUE
>>>> GenomicRanges
>>>> $<-,SummarizedExperiment-method                              TRUE
>>>> GenomicRanges
>>>>
>>>> Martin
>>>>
>>>>
>>>>> Pete
>>>>>
>>>>> ____________________
>>>>> Peter M. Haverty, Ph.D.
>>>>> Genentech, Inc.
>>>>> phaverty at gene.com
>>>>>
>>>>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
>>>>> lawrence.michael at gene.com>
>>>>> wrote:
>>>>>
>>>>>   I think we need to make sure that there are enough benefits of
>>>>>> something
>>>>>> like GRangesFrame before we introduce yet another complicated and
>>>>>> overlapping data structure into the framework. Prior to summarization,
>>>>>> the
>>>>>> ranges seem primary, after summarization, it may often make sense for
>>>>>> them
>>>>>> to be secondary. But I'm just not sure what we gain from a new data
>>>>>> structure.
>>>>>>
>>>>>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s <hpages at fredhutch.org>
>>>>>> wrote:
>>>>>>
>>>>>>   GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>>>>
>>>>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>>>>
>>>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>>>>
>>>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>>>>                     some accessor (e.g. rowRanges())
>>>>>>>
>>>>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>>>>> can hold, but different in terms of API: the former has the ranges
>>>>>>> API as primary API and the DataFrame API on its mcols() component,
>>>>>>> and the latter has the DataFrame API as primary API and the ranges
>>>>>>> API on its rowRanges() component. Nice switch!
>>>>>>>
>>>>>>> What does this API switch bring us? A GRangesFrame object is now
>>>>>>> an object that fully behaves like a DataFrame and people can also
>>>>>>> perform range-based operations on its rowRanges() component.
>>>>>>> Here is what I'm afraid is going to happen: people will also want
>>>>>>> to be able to perform range-based operations *directly* on
>>>>>>> these objects, i.e. without having to call rowRanges() first.
>>>>>>> So for example when they do subsetByOverlaps(), subsetting
>>>>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>>>>> would contain row indices. Problem with this is that these objects
>>>>>>> now start to suffer from the "dual personality syndrome". For
>>>>>>> example, it's not clear anymore what their length should be.
>>>>>>> Strictly speaking it should be their number of columns (that's
>>>>>>> what the length of a DataFrame is), but the ranges API that
>>>>>>> we're trying to put on them also makes them feel like vectors
>>>>>>> along the vertical dimension so it also feels that their length
>>>>>>> should be their number of rows. Same thing with 1D subsetting.
>>>>>>> Why does it subset the columns and not the rows? Most people
>>>>>>> are now confused.
>>>>>>>
>>>>>>> It's interesting to note that the same thing happens with GRanges
>>>>>>> objects, but in the opposite direction: people wish they could
>>>>>>> do DataFrame operations directly on them without calling mcols()
>>>>>>> first. But in order to preserve the good health of GRanges objects,
>>>>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>>>>> the pressure was just too strong).
>>>>>>>
>>>>>>> H.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>>   Should be possible for the annotations to be of any type, as long as
>>>>>>>> they
>>>>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have
>>>>>>>> a
>>>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to
>>>>>>>> have a
>>>>>>>> special class for the container with range information. The contract
>>>>>>>> for
>>>>>>>> the range annotation would be to have a granges() method.
>>>>>>>>
>>>>>>>> I agree it would be nice if there was a way with the methods package
>>>>>>>> to
>>>>>>>> easily assert such contracts. For example, one could define an
>>>>>>>> interface
>>>>>>>> with a set of generics (and optionally the relevant position in the
>>>>>>>> generic
>>>>>>>> signature). Then, once all of the methods have been assigned for a
>>>>>>>> particular class, it is made to inherit from that contract class.
>>>>>>>> There
>>>>>>>> are
>>>>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <
>>>>>>>> haverty.peter at gene.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     There are some nice similarities in these new imaginary types.  A
>>>>>>>>
>>>>>>>>>
>>>>>>>>> "GRangesFrame" is a list of dimensionally identical things
>>>>>>>>> (columns) and
>>>>>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>>>>>>> list
>>>>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>>>>>> objects,
>>>>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>>>>> GRangesFrame).
>>>>>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>>>>>
>>>>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>>>>>> implements this set of methods ...
>>>>>>>>>
>>>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>>>>>>>> become
>>>>>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>>>>>>> along
>>>>>>>>> with genoset will still be available.
>>>>>>>>>
>>>>>>>>> Pete
>>>>>>>>>
>>>>>>>>> ____________________
>>>>>>>>> Peter M. Haverty, Ph.D.
>>>>>>>>> Genentech, Inc.
>>>>>>>>> phaverty at gene.com
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <
>>>>>>>>> tim.triche at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>     This.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>>>>>> out of
>>>>>>>>>> an object that held several such assays at several time points in a
>>>>>>>>>> population, where there are both assay-wise and covariate-wise
>>>>>>>>>> "holes"
>>>>>>>>>> that
>>>>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Statistics is the grammar of science.
>>>>>>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <
>>>>>>>>>> haverty.peter at gene.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>      I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>>>>
>>>>>>>>>>>>   which would make this easy, but I don't seem to be winning that
>>>>>>>>>>>>>
>>>>>>>>>>>>>   argument.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>   Just impossible. As Michael mentioned back in November, they
>>>>>>>>>>>> have
>>>>>>>>>>>> conflicting APIs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>>>>> (without mcols) as an index?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>             [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>              [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>> Herv� Pag�s
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpages at fredhutch.org
>>>>>>> Phone:  (206) 667-5791
>>>>>>> Fax:    (206) 667-1319
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, Seattle, WA 98109
>>
>> Email: vobencha at fredhutch.org
>> Phone: (206) 667-3158
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793