[Bioc-devel] SummarizedExperiment vs ExpressionSet

Hervé Pagès hpages at fredhutch.org
Wed Nov 26 21:11:23 CET 2014


Hi guys,

I like the idea of separating the row data from the row ranges.
This could be formalized with 2 distinct accessors: rowData() and
rowRanges(). The former would return a DataFrame, and the latter
NULL or a range-based object (GRanges or GRangesList).
I don't think there is the need for an emptyRanges class.

H.

On 11/26/2014 11:40 AM, Hector Corrada Bravo wrote:
> One thing that’s become apparent working on epivizr is that it may be useful to think about ‘rowData’ in a SummarizedExperiment as having two distinct components: row coordinates and row metadata. In the current class rowData is a ‘GenomicRanges’ which contains both coordinates (the ranges) and metadata (mcols(rowData)). In metagenomics (the other application my group works a lot with), we think of the taxonomy as providing coordinates. The distinction is worthwhile thinking about since there are certain operations we do on coordinates that we don’t do with metadata (and conversely).
>
>
>
>
> Thinking about it this way, the ‘ExpressionSet’ object would be data without coordinates. So, I would avoid making ‘GenomicRanges’ behave like ‘DataFrame’ since this distinction between coordinates and metadata is lost. The ‘emptyRanges’ proposal gets closer to this since this corresponds to ‘no coordinates’, but it may be worth thinking in the long term on making the coordinate/metadata distinction more general.
>
>
>
>
> Hector
>
> On Wed, Nov 26, 2014 at 12:38 PM, Tim Triche, Jr. <tim.triche at gmail.com>
> wrote:
>
>> so as a simple experiment, I did the following:
>> library(GenomicRanges)
>> bar <- matrix(rnorm(100), ncol=10)
>> colnames(bar) <- as.character(1:10)
>> rownames(bar) <- letters[1:10]
>> foo <- SummarizedExperiment(assays=list(bar=bar))
>> rowData(foo)
>> ## GRangesList object of length 10:
>> ## $a
>> ## GRanges object with 0 ranges and 0 metadata columns:
>> ##    seqnames    ranges strand
>> ##       <Rle> <IRanges>  <Rle>
>> ##
>> ## $b
>> ## GRanges object with 0 ranges and 0 metadata columns:
>> ##      seqnames ranges strand
>> ##
>> ## $c
>> ## GRanges object with 0 ranges and 0 metadata columns:
>> ##      seqnames ranges strand
>> ##
>> ## ...
>> ## <7 more elements>
>> colData(foo)
>> ## DataFrame with 10 rows and 0 columns
>> This got me to thinking, why not have an emptyRanges class, or else the
>> ability to index a bunch of NULL ranges without a lot of hoohah?  The
>> defaults mostly do what they're supposed to; why not have a compact
>> representation of empty rowData as for empty colData (i.e., a DataFrame
>> with 0 rows)?  Or is a GRangesList of empty GRanges as compact as it is
>> practicable to get for this purpose?
>> Just pondering what the lowest-impact solution to the problem at hand might
>> be.
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> On Wed, Nov 26, 2014 at 9:07 AM, Peter Haverty <haverty.peter at gene.com>
>> wrote:
>>> Hi all,
>>>
>>> I believe there is a strong need for an object that organizes a collection
>>> of rectangular data (matrices, etc.) with metadata on the rows and
>>> columns.  Can SummarizedExperiment inherit from something simpler that has
>>> a DataFrame as rowData?  (I believe GenomicRanges should inherit from
>>> DataTable, rather than Vector, and subset as x[i,j], but maybe that's
>>> getting a bit off topic.)  I often see people stuffing arbitrary data into
>>> an ExpressionSet and calling one of the assays "exprs" as a work-around.
>>>
>>> Regards,
>>>
>>> Pete
>>>
>>> ____________________
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phaverty at gene.com
>>>
>>> On Wed, Nov 26, 2014 at 7:19 AM, Laurent Gatto <lg390 at cam.ac.uk> wrote:
>>>
>>>>
>>>> On 26 November 2014 14:59, Wolfgang Huber wrote:
>>>>
>>>>> A colleague and I are designing a package for quantitative proteomics
>>>>> data, and we are debating whether to base it on the
>>>>> SummarizedExperiment or the ExpressionSet class.
>>>>>
>>>>> There is no immediate use for the ranges aspect of
>>>>> SummarizedExperiment, so that would have to be carried around with
>>>>> NAs, and this is a parsimony argument for using ExpressionSet
>>>>> instead. OTOH, the interface of SummarizedExperiment is cleaner, its
>>>>> code more modern and more likely to be updated, and users of the
>>>>> Bioconductor project are likely to benefit from having to deal with a
>>>>> single interface that works the same or similarly across packages,
>>>>> rather than a variety of formats; which argues that new packages
>>>>> should converge towards SummarizedExperiment('s interface).
>>>>>
>>>>> Are there any pertinent insights from this group?
>>>>
>>>> Instead of ExpressionSet, you could use MSnbase::MSnSet, which is
>>>> essentially an ExpressionSet for quantitative proteomics (i.e it has a
>>>> MIAPE slot, instead of MIAME for example).
>>>>
>>>> Ideally, a SummarizedExperiment for proteomics would use peptide/protein
>>>> ranges, which is in the pipeline, as far as I am concerned. When that
>>>> becomes available, there should be infrastructure to coerce and MSnSet
>>>> (and/or other relevant data) into an SummarizedExperiment.
>>>>
>>>> Hope this helps.
>>>>
>>>> Best wishes,
>>>>
>>>> Laurent
>>>>
>>>>> Thanks and best wishes
>>>>> Wolfgang
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>> --
>>>> Laurent Gatto
>>>> http://cpu.sysbiol.cam.ac.uk/
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>> 	[[alternative HTML version deleted]]
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list