[Bioc-devel] Changes to the SummarizedExperiment Class

Peter Haverty haverty.peter at gene.com
Wed Mar 4 19:28:06 CET 2015

Clarification:  the complexity of the full BioC class universe, not the
SE/eSet part. GenomicRanges, GRanges, GRangesList, RangesView,
RangesViewsList, ... I think all of that intimidates new people.  Maybe
that's not generally the case.  Sorry, I've taken this thread way off
topic.  I'll stop now.


Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Wed, Mar 4, 2015 at 10:08 AM, Tim Triche, Jr. <tim.triche at gmail.com>

> What complexity?  The Nature Methods paper laid it out: for most people,
> most of the time, use an SE.
> That way, the organization of metadata and covariates is enforced for you,
> like an ExpressionSet (another winning data structure) but without its
> baggage.
> Maybe the "Summarized" in the name isn't such a bad idea after all.
>  "AfterTheDataMungingIsDone" doesn't have the same ring to it.
> What would be equally awesome IMHO is to have a similarly unifying
> structure for integrative work.
> But that's just, like, my opinion.  I've taken a whack at it when I knew
> even less than I do now, and it's hard.  However, data management for
> expression arrays was hard, too.  If I'm not mistaken, there were benefits
> to solving that data management problem, too.  Some sort of a software
> project.  I think it was called "MADMAN".  I'll have to go look.  ;-)
> Statistics is the grammar of science.
> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty <haverty.peter at gene.com>
> wrote:
>>  Michael has a good point. The complexity of the BioC universe of
>> classes hurts our ability to attract new users. More classes would be a
>> minus there ... but a small set of common, explicit APIs would simplify
>> things.  Rectangular things implement the matrix Interface.  :-)
>> Deprecating old stuff, like eSet, might help more than it hurts, on the
>> simplicity front.
>>  P.S. apropos of understanding this universe of classes, I *love* the
>> methods(class=x) thing Vincent mentioned.
>>  Pete
>> ____________________
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phaverty at gene.com
>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
>> lawrence.michael at gene.com> wrote:
>>> I think we need to make sure that there are enough benefits of something
>>> like GRangesFrame before we introduce yet another complicated and
>>> overlapping data structure into the framework. Prior to summarization, the
>>> ranges seem primary, after summarization, it may often make sense for them
>>> to be secondary. But I'm just not sure what we gain from a new data
>>> structure.
>>> On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès <hpages at fredhutch.org>
>>> wrote:
>>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>                  some accessor (e.g. rowRanges())
>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>> can hold, but different in terms of API: the former has the ranges
>>>> API as primary API and the DataFrame API on its mcols() component,
>>>> and the latter has the DataFrame API as primary API and the ranges
>>>> API on its rowRanges() component. Nice switch!
>>>> What does this API switch bring us? A GRangesFrame object is now
>>>> an object that fully behaves like a DataFrame and people can also
>>>> perform range-based operations on its rowRanges() component.
>>>> Here is what I'm afraid is going to happen: people will also want
>>>> to be able to perform range-based operations *directly* on
>>>> these objects, i.e. without having to call rowRanges() first.
>>>> So for example when they do subsetByOverlaps(), subsetting
>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>> would contain row indices. Problem with this is that these objects
>>>> now start to suffer from the "dual personality syndrome". For
>>>> example, it's not clear anymore what their length should be.
>>>> Strictly speaking it should be their number of columns (that's
>>>> what the length of a DataFrame is), but the ranges API that
>>>> we're trying to put on them also makes them feel like vectors
>>>> along the vertical dimension so it also feels that their length
>>>> should be their number of rows. Same thing with 1D subsetting.
>>>> Why does it subset the columns and not the rows? Most people
>>>> are now confused.
>>>> It's interesting to note that the same thing happens with GRanges
>>>> objects, but in the opposite direction: people wish they could
>>>> do DataFrame operations directly on them without calling mcols()
>>>> first. But in order to preserve the good health of GRanges objects,
>>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>>> the pressure was just too strong).
>>>> H.
>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>>> Should be possible for the annotations to be of any type, as long as
>>>>> they
>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to have
>>>>> a
>>>>> special class for the container with range information. The contract
>>>>> for
>>>>> the range annotation would be to have a granges() method.
>>>>> I agree it would be nice if there was a way with the methods package to
>>>>> easily assert such contracts. For example, one could define an
>>>>> interface
>>>>> with a set of generics (and optionally the relevant position in the
>>>>> generic
>>>>> signature). Then, once all of the methods have been assigned for a
>>>>> particular class, it is made to inherit from that contract class.
>>>>> There are
>>>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com>
>>>>> wrote:
>>>>>  There are some nice similarities in these new imaginary types.  A
>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns)
>>>>>> and
>>>>>> some row meta-data (the GRanges).  The SE-like object is similarly a
>>>>>> list
>>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix
>>>>>> objects,
>>>>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>>>>> GRangesFrame).
>>>>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>>>> I wonder if this kind of thing would be easier if we had Java-style
>>>>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>>>>> implements this set of methods ...
>>>>>> Oh, and kinda apropos, the genoset class will probably go away or
>>>>>> become
>>>>>> an extension to this new SE-like thing.  The extra stuff that comes
>>>>>> along
>>>>>> with genoset will still be available.
>>>>>> Pete
>>>>>> ____________________
>>>>>> Peter M. Haverty, Ph.D.
>>>>>> Genentech, Inc.
>>>>>> phaverty at gene.com
>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com
>>>>>> >
>>>>>> wrote:
>>>>>>  This.
>>>>>>> It would be damned near perfect as a return value for assays coming
>>>>>>> out of
>>>>>>> an object that held several such assays at several time points in a
>>>>>>> population, where there are both assay-wise and covariate-wise
>>>>>>> "holes"
>>>>>>> that
>>>>>>> could nonetheless be usefully imputed across assays.
>>>>>>> Statistics is the grammar of science.
>>>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <
>>>>>>> haverty.peter at gene.com>
>>>>>>> wrote:
>>>>>>>>>   I still think GRanges should be a subclass of DataFrame,
>>>>>>>>>> which would make this easy, but I don't seem to be winning that
>>>>>>>>>  argument.
>>>>>>>>> Just impossible. As Michael mentioned back in November, they have
>>>>>>>>> conflicting APIs.
>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>>>>>>>> (without mcols) as an index?
>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>         [[alternative HTML version deleted]]
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>   --
>>>> Hervé Pagès
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>> E-mail: hpages at fredhutch.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319

	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list