[Bioc-devel] SummarizedExperiment vs ExpressionSet

Michael Lawrence lawrence.michael at gene.com
Fri Dec 5 20:31:23 CET 2014


Sounds good. One note: if range information becomes optional, it would be
nice if we could mark the availability of the information in the class
hierarchy. Otherwise, it's not easy to enforce a contract (that we can call
range-based methods on a SE) through dispatch. An alternative would be to
drop direct range-based accessors and operations from SummarizedExperiment,
although that potentially puts more burden on the user.

On Mon, Dec 1, 2014 at 10:30 AM, Martin Morgan <mtmorgan at fredhutch.org>
wrote:

> On 11/26/2014 12:11 PM, Hervé Pagès wrote:
>
>> Hi guys,
>>
>> I like the idea of separating the row data from the row ranges.
>> This could be formalized with 2 distinct accessors: rowData() and
>> rowRanges(). The former would return a DataFrame, and the latter
>> NULL or a range-based object (GRanges or GRangesList).
>> I don't think there is the need for an emptyRanges class.
>>
>
> For the original question, I think the ability to store genomic
> coordinates as well as other 'S4Vector' classes is very helpful for
> advanced users, even if a little intimidating for novice users.
>
> Also, it's clear that SummarizedExperiment in its current form doesn't
> satisfy the common use case of identifiers without range information.
>
> I think it makes sense to enable some like Herve outlines above, where the
> rowData() are separated into range information and annotation information,
> and I'll move forward with that implementation over the next week or so.
>
> Martin
>
>
>
>> H.
>>
>> On 11/26/2014 11:40 AM, Hector Corrada Bravo wrote:
>>
>>> One thing that’s become apparent working on epivizr is that it may be
>>> useful
>>> to think about ‘rowData’ in a SummarizedExperiment as having two distinct
>>> components: row coordinates and row metadata. In the current class
>>> rowData is
>>> a ‘GenomicRanges’ which contains both coordinates (the ranges) and
>>> metadata
>>> (mcols(rowData)). In metagenomics (the other application my group works
>>> a lot
>>> with), we think of the taxonomy as providing coordinates. The
>>> distinction is
>>> worthwhile thinking about since there are certain operations we do on
>>> coordinates that we don’t do with metadata (and conversely).
>>>
>>>
>>>
>>>
>>> Thinking about it this way, the ‘ExpressionSet’ object would be data
>>> without
>>> coordinates. So, I would avoid making ‘GenomicRanges’ behave like
>>> ‘DataFrame’
>>> since this distinction between coordinates and metadata is lost. The
>>> ‘emptyRanges’ proposal gets closer to this since this corresponds to ‘no
>>> coordinates’, but it may be worth thinking in the long term on making the
>>> coordinate/metadata distinction more general.
>>>
>>>
>>>
>>>
>>> Hector
>>>
>>> On Wed, Nov 26, 2014 at 12:38 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>> wrote:
>>>
>>>  so as a simple experiment, I did the following:
>>>> library(GenomicRanges)
>>>> bar <- matrix(rnorm(100), ncol=10)
>>>> colnames(bar) <- as.character(1:10)
>>>> rownames(bar) <- letters[1:10]
>>>> foo <- SummarizedExperiment(assays=list(bar=bar))
>>>> rowData(foo)
>>>> ## GRangesList object of length 10:
>>>> ## $a
>>>> ## GRanges object with 0 ranges and 0 metadata columns:
>>>> ##    seqnames    ranges strand
>>>> ##       <Rle> <IRanges>  <Rle>
>>>> ##
>>>> ## $b
>>>> ## GRanges object with 0 ranges and 0 metadata columns:
>>>> ##      seqnames ranges strand
>>>> ##
>>>> ## $c
>>>> ## GRanges object with 0 ranges and 0 metadata columns:
>>>> ##      seqnames ranges strand
>>>> ##
>>>> ## ...
>>>> ## <7 more elements>
>>>> colData(foo)
>>>> ## DataFrame with 10 rows and 0 columns
>>>> This got me to thinking, why not have an emptyRanges class, or else the
>>>> ability to index a bunch of NULL ranges without a lot of hoohah?  The
>>>> defaults mostly do what they're supposed to; why not have a compact
>>>> representation of empty rowData as for empty colData (i.e., a DataFrame
>>>> with 0 rows)?  Or is a GRangesList of empty GRanges as compact as it is
>>>> practicable to get for this purpose?
>>>> Just pondering what the lowest-impact solution to the problem at hand
>>>> might
>>>> be.
>>>> Statistics is the grammar of science.
>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>> On Wed, Nov 26, 2014 at 9:07 AM, Peter Haverty <haverty.peter at gene.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I believe there is a strong need for an object that organizes a
>>>>> collection
>>>>> of rectangular data (matrices, etc.) with metadata on the rows and
>>>>> columns.  Can SummarizedExperiment inherit from something simpler that
>>>>> has
>>>>> a DataFrame as rowData?  (I believe GenomicRanges should inherit from
>>>>> DataTable, rather than Vector, and subset as x[i,j], but maybe that's
>>>>> getting a bit off topic.)  I often see people stuffing arbitrary data
>>>>> into
>>>>> an ExpressionSet and calling one of the assays "exprs" as a
>>>>> work-around.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Pete
>>>>>
>>>>> ____________________
>>>>> Peter M. Haverty, Ph.D.
>>>>> Genentech, Inc.
>>>>> phaverty at gene.com
>>>>>
>>>>> On Wed, Nov 26, 2014 at 7:19 AM, Laurent Gatto <lg390 at cam.ac.uk>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> On 26 November 2014 14:59, Wolfgang Huber wrote:
>>>>>>
>>>>>>  A colleague and I are designing a package for quantitative proteomics
>>>>>>> data, and we are debating whether to base it on the
>>>>>>> SummarizedExperiment or the ExpressionSet class.
>>>>>>>
>>>>>>> There is no immediate use for the ranges aspect of
>>>>>>> SummarizedExperiment, so that would have to be carried around with
>>>>>>> NAs, and this is a parsimony argument for using ExpressionSet
>>>>>>> instead. OTOH, the interface of SummarizedExperiment is cleaner, its
>>>>>>> code more modern and more likely to be updated, and users of the
>>>>>>> Bioconductor project are likely to benefit from having to deal with a
>>>>>>> single interface that works the same or similarly across packages,
>>>>>>> rather than a variety of formats; which argues that new packages
>>>>>>> should converge towards SummarizedExperiment('s interface).
>>>>>>>
>>>>>>> Are there any pertinent insights from this group?
>>>>>>>
>>>>>>
>>>>>> Instead of ExpressionSet, you could use MSnbase::MSnSet, which is
>>>>>> essentially an ExpressionSet for quantitative proteomics (i.e it has a
>>>>>> MIAPE slot, instead of MIAME for example).
>>>>>>
>>>>>> Ideally, a SummarizedExperiment for proteomics would use
>>>>>> peptide/protein
>>>>>> ranges, which is in the pipeline, as far as I am concerned. When that
>>>>>> becomes available, there should be infrastructure to coerce and MSnSet
>>>>>> (and/or other relevant data) into an SummarizedExperiment.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Laurent
>>>>>>
>>>>>>  Thanks and best wishes
>>>>>>> Wolfgang
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Laurent Gatto
>>>>>> http://cpu.sysbiol.cam.ac.uk/
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>      [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list