[Bioc-devel] Changes to the SummarizedExperiment Class

Vincent Carey stvjc at channing.harvard.edu
Mon Mar 9 15:30:14 CET 2015


I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

> It sounds like the proposed changes are already made.  However (like
> others) I am still a bit mystified why this was necessary.  The old version
> did allow for a GRanges inside the DataFrame of the rowData, as far as I
> recall.  So I assume this is for efficiency.  But why?  What kind of
> data/use cases is this for?
>
> I am happy to hear that SummarizedExperiment is going to be spun out into
> its own package.  When that happens, I have some comments, which I'll
> include here in anticipation
>   1) I now very strongly believe it was a design mistake to not have
> colnames on the assays.  The advantage of this choice is that sampleNames
> are only stored one place.  The extreme disadvantage is the high
> ineffeciency when you want colnames on an extracted assay.
>

after example(SummarizedExperiment)

> colnames(assays(se1)[[1]])
[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently

> rownames(assays(se1)[[1]]) = as.character(1:200)

> rownames(assays(se1)[[1]])

NULL
seems we could issue a warning there

  2) I still strongly believe we should support pData, sampleNames etc etc
> on SummarizedExperiments.
>

worthy of discussion


>   3) Having developed a package (minfi) where eSets co-exists with
> SummarizedExperiment, I have to mention that for the developer there is a
> number of places where the different internals of these two classes makes
> like irritating.  For this reason I would support a "modern" implementation
> of eSet, in parallel with SummarizedExperiment.
>
>
also worthy of further discussion IMHO


> Best,
> Kasper
>
> On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain <vobencha at fredhutch.org
> >
> wrote:
>
> > Hi Mike,
> >
> > Our error - we didn't bump GenomicRanges when rowRanges was added.
> > Hopefully 1.19.43 will propagate today and things will be sorted out.
> >
> > Val
> >
> >
> > On 03/06/2015 07:40 AM, Michael Love wrote:
> >
> >> hi all,
> >>
> >> just a practical issue: I have GenomicRanges version 1.19.42 on my
> >> computer which does not have rowRanges defined, although the 1.19.42
> >> version on the Bioc website does have rowRanges in the man page:
> >>
> >>
> http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html
> >>
> >> So I pass check locally but not in the devel branch on Bioc servers.
> >>
> >>  library(GenomicRanges)
> >>> rowRanges
> >>>
> >> Error: object 'rowRanges' not found
> >>
> >>> sessionInfo()
> >>>
> >> R Under development (unstable) (2014-12-08 r67137)
> >> Platform: x86_64-apple-darwin12.5.0 (64-bit)
> >>
> >> locale:
> >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>
> >> attached base packages:
> >> [1] stats4    parallel  stats     graphics  grDevices datasets  utils
> >>     methods   base
> >>
> >> other attached packages:
> >> [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
> >> S4Vectors_0.5.21
> >> [5] BiocGenerics_0.13.6   RUnit_0.4.28          devtools_1.7.0
> >> knitr_1.9
> >> [9] BiocInstaller_1.17.5
> >>
> >>
> >>
> >> On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan <mtmorgan at fredhutch.org>
> >> wrote:
> >>
> >>>
> >>> On 03/04/2015 10:03 AM, Peter Haverty wrote:
> >>>
> >>>>
> >>>> Michael has a good point. The complexity of the BioC universe of
> classes
> >>>> hurts our ability to attract new users. More classes would be a minus
> >>>> there
> >>>> ... but a small set of common, explicit APIs would simplify things.
> >>>> Rectangular things implement the matrix Interface.  :-) Deprecating
> old
> >>>> stuff, like eSet, might help more than it hurts, on the simplicity
> >>>> front.
> >>>>
> >>>> P.S. apropos of understanding this universe of classes, I *love* the
> >>>> methods(class=x) thing Vincent mentioned.
> >>>>
> >>>
> >>>
> >>> The current version, under R-devel, is at
> >>>
> >>>    devtools::source_gist("https://gist.github.com/mtmorgan/
> >>> 9f98871adb9f0c1891a4")
> >>>
> >>>    > methods(class="SummarizedExperiment")
> >>>     [1] [                 [[                [[<-              [<-
> >>>     [5] $                 $<-               assay             assay<-
> >>>     [9] assayNames        assayNames<-      assays            assays<-
> >>>    [13] cbind             coerce            colData           colData<-
> >>>    [17] compare           Compare           countOverlaps     coverage
> >>>    [21] dim               dimnames          dimnames<-
> >>> disjointBins
> >>>    [25] distance          distanceToNearest duplicated
> >>> elementMetadata
> >>>    [29] elementMetadata<- end               end<-             exptData
> >>>    [33] exptData<-        extractROWS       findOverlaps      flank
> >>>    [37] follow            granges           isDisjoint        mcols
> >>>    [41] mcols<-           narrow            nearest           order
> >>>    [45] overlapsAny       precede           ranges            ranges<-
> >>>    [49] rank              rbind             replaceROWS       resize
> >>>    [53] restrict          rowData           rowData<-         seqinfo
> >>>    [57] seqinfo<-         seqnames          shift             show
> >>>    [61] sort              split             start             start<-
> >>>    [65] strand            strand<-          subset
> >>> subsetByOverlaps
> >>>    [69] updateObject      values            values<-          width
> >>>    [73] width<-
> >>>
> >>>    see ?"methods" for accessing help and source code
> >>>
> >>> and
> >>>
> >>>  head(attr(methods(class="SummarizedExperiment"), "info"))
> >>>>
> >>>                                                               generic
> >>> visible
> >>> [,SummarizedExperiment,ANY-method                                  [
> >>> TRUE
> >>> [[,SummarizedExperiment,ANY,missing-method                        [[
> >>> TRUE
> >>> [[<-,SummarizedExperiment,ANY,missing-method                    [[<-
> >>> TRUE
> >>> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method     [<-
> >>> TRUE
> >>> $,SummarizedExperiment-method                                      $
> >>> TRUE
> >>> $<-,SummarizedExperiment-method                                  $<-
> >>> TRUE
> >>>                                                               isS4
> >>>     from
> >>> [,SummarizedExperiment,ANY-method                            TRUE
> >>> GenomicRanges
> >>> [[,SummarizedExperiment,ANY,missing-method                   TRUE
> >>> GenomicRanges
> >>> [[<-,SummarizedExperiment,ANY,missing-method                 TRUE
> >>> GenomicRanges
> >>> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE
> >>> GenomicRanges
> >>> $,SummarizedExperiment-method                                TRUE
> >>> GenomicRanges
> >>> $<-,SummarizedExperiment-method                              TRUE
> >>> GenomicRanges
> >>>
> >>> Martin
> >>>
> >>>
> >>>> Pete
> >>>>
> >>>> ____________________
> >>>> Peter M. Haverty, Ph.D.
> >>>> Genentech, Inc.
> >>>> phaverty at gene.com
> >>>>
> >>>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
> >>>> lawrence.michael at gene.com>
> >>>> wrote:
> >>>>
> >>>>  I think we need to make sure that there are enough benefits of
> >>>>> something
> >>>>> like GRangesFrame before we introduce yet another complicated and
> >>>>> overlapping data structure into the framework. Prior to
> summarization,
> >>>>> the
> >>>>> ranges seem primary, after summarization, it may often make sense for
> >>>>> them
> >>>>> to be secondary. But I'm just not sure what we gain from a new data
> >>>>> structure.
> >>>>>
> >>>>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s <hpages at fredhutch.org>
> >>>>> wrote:
> >>>>>
> >>>>>  GRangesFrame is an interesting idea and I gave it some thoughts.
> >>>>>>
> >>>>>> There is this nice symmetry between GRanges and GRangesFrame:
> >>>>>>
> >>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
> >>>>>>
> >>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
> >>>>>>                    some accessor (e.g. rowRanges())
> >>>>>>
> >>>>>> So GRanges and GRangesFrame are equivalent in terms of what they
> >>>>>> can hold, but different in terms of API: the former has the ranges
> >>>>>> API as primary API and the DataFrame API on its mcols() component,
> >>>>>> and the latter has the DataFrame API as primary API and the ranges
> >>>>>> API on its rowRanges() component. Nice switch!
> >>>>>>
> >>>>>> What does this API switch bring us? A GRangesFrame object is now
> >>>>>> an object that fully behaves like a DataFrame and people can also
> >>>>>> perform range-based operations on its rowRanges() component.
> >>>>>> Here is what I'm afraid is going to happen: people will also want
> >>>>>> to be able to perform range-based operations *directly* on
> >>>>>> these objects, i.e. without having to call rowRanges() first.
> >>>>>> So for example when they do subsetByOverlaps(), subsetting
> >>>>>> happens vertically. Also the Hits object returned by findOverlaps()
> >>>>>> would contain row indices. Problem with this is that these objects
> >>>>>> now start to suffer from the "dual personality syndrome". For
> >>>>>> example, it's not clear anymore what their length should be.
> >>>>>> Strictly speaking it should be their number of columns (that's
> >>>>>> what the length of a DataFrame is), but the ranges API that
> >>>>>> we're trying to put on them also makes them feel like vectors
> >>>>>> along the vertical dimension so it also feels that their length
> >>>>>> should be their number of rows. Same thing with 1D subsetting.
> >>>>>> Why does it subset the columns and not the rows? Most people
> >>>>>> are now confused.
> >>>>>>
> >>>>>> It's interesting to note that the same thing happens with GRanges
> >>>>>> objects, but in the opposite direction: people wish they could
> >>>>>> do DataFrame operations directly on them without calling mcols()
> >>>>>> first. But in order to preserve the good health of GRanges objects,
> >>>>>> we've not done that (except for $, a shortcut for mcols(x)$,
> >>>>>> the pressure was just too strong).
> >>>>>>
> >>>>>> H.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
> >>>>>>
> >>>>>>  Should be possible for the annotations to be of any type, as long
> as
> >>>>>>> they
> >>>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could
> have
> >>>>>>> a
> >>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to
> >>>>>>> have a
> >>>>>>> special class for the container with range information. The
> contract
> >>>>>>> for
> >>>>>>> the range annotation would be to have a granges() method.
> >>>>>>>
> >>>>>>> I agree it would be nice if there was a way with the methods
> package
> >>>>>>> to
> >>>>>>> easily assert such contracts. For example, one could define an
> >>>>>>> interface
> >>>>>>> with a set of generics (and optionally the relevant position in the
> >>>>>>> generic
> >>>>>>> signature). Then, once all of the methods have been assigned for a
> >>>>>>> particular class, it is made to inherit from that contract class.
> >>>>>>> There
> >>>>>>> are
> >>>>>>> lots of gotchas though. Not sure how useful it would be in
> practice.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <
> >>>>>>> haverty.peter at gene.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>    There are some nice similarities in these new imaginary types.
> A
> >>>>>>>
> >>>>>>>>
> >>>>>>>> "GRangesFrame" is a list of dimensionally identical things
> >>>>>>>> (columns) and
> >>>>>>>> some row meta-data (the GRanges).  The SE-like object is
> similarly a
> >>>>>>>> list
> >>>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
> >>>>>>>> objects,
> >>>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
> >>>>>>>> GRangesFrame).
> >>>>>>>> Elegant?  Maybe they would actually be relatives in the class
> tree.
> >>>>>>>>
> >>>>>>>> I wonder if this kind of thing would be easier if we had
> Java-style
> >>>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something
> that
> >>>>>>>> implements this set of methods ...
> >>>>>>>>
> >>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
> >>>>>>>> become
> >>>>>>>> an extension to this new SE-like thing.  The extra stuff that
> comes
> >>>>>>>> along
> >>>>>>>> with genoset will still be available.
> >>>>>>>>
> >>>>>>>> Pete
> >>>>>>>>
> >>>>>>>> ____________________
> >>>>>>>> Peter M. Haverty, Ph.D.
> >>>>>>>> Genentech, Inc.
> >>>>>>>> phaverty at gene.com
> >>>>>>>>
> >>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <
> >>>>>>>> tim.triche at gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>    This.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> It would be damned near perfect as a return value for assays
> coming
> >>>>>>>>> out of
> >>>>>>>>> an object that held several such assays at several time points
> in a
> >>>>>>>>> population, where there are both assay-wise and covariate-wise
> >>>>>>>>> "holes"
> >>>>>>>>> that
> >>>>>>>>> could nonetheless be usefully imputed across assays.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Statistics is the grammar of science.
> >>>>>>>>> Karl Pearson <
> http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <
> >>>>>>>>> haverty.peter at gene.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>     I still think GRanges should be a subclass of DataFrame,
> >>>>>>>>>>>
> >>>>>>>>>>>  which would make this easy, but I don't seem to be winning
> that
> >>>>>>>>>>>>
> >>>>>>>>>>>>  argument.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>  Just impossible. As Michael mentioned back in November, they
> >>>>>>>>>>> have
> >>>>>>>>>>> conflicting APIs.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a
> GRanges
> >>>>>>>>>> (without mcols) as an index?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>            [[alternative HTML version deleted]]
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>             [[alternative HTML version deleted]]
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>            [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>
> >>>>>>>
> >>>>>>>  --
> >>>>>> Herv� Pag�s
> >>>>>>
> >>>>>> Program in Computational Biology
> >>>>>> Division of Public Health Sciences
> >>>>>> Fred Hutchinson Cancer Research Center
> >>>>>> 1100 Fairview Ave. N, M1-B514
> >>>>>> P.O. Box 19024
> >>>>>> Seattle, WA 98109-1024
> >>>>>>
> >>>>>> E-mail: hpages at fredhutch.org
> >>>>>> Phone:  (206) 667-5791
> >>>>>> Fax:    (206) 667-1319
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>          [[alternative HTML version deleted]]
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioc-devel at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>
> >>>>
> >>>
> >>> --
> >>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>> 1100 Fairview Ave. N.
> >>> PO Box 19024 Seattle, WA 98109
> >>>
> >>> Location: Arnold Building M1 B861
> >>> Phone: (206) 667-2793
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >
> > --
> > Computational Biology / Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, Seattle, WA 98109
> >
> > Email: vobencha at fredhutch.org
> > Phone: (206) 667-3158
> >
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list