[Bioc-devel] Changes to the SummarizedExperiment Class

Martin Morgan mtmorgan at fredhutch.org
Mon Mar 9 17:22:10 CET 2015


On 03/09/2015 07:36 AM, Kasper Daniel Hansen wrote:
> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>> I am glad you are keeping this discussion alive Kasper.
>>
>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>
>>> It sounds like the proposed changes are already made.  However (like
>>> others) I am still a bit mystified why this was necessary.  The old
>>> version
>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>>> data/use cases is this for?

Actually the design has GRanges on the 'outside'; a DataFrame can be emulated 
with GRangesList of 0 elements, with mcols() the DataFrame, but this is 
obviously a hack. Simon Anders argued perhaps 5 years ago for DataFrame on the 
'outside', allowing for GRanges on the inside; maybe that would have been a 
better original design, but I guess I was stuck on the the defining 
characteristic of sequencing experiments being range-based, the expressive power 
of reliably overlapping say ranges of differentially expressed genes with ranges 
of variants or ChIP binding sites, and a desire not to introduce a plethora 
(e.g., 2) of classes.

You can think of what has been done so far as simply renaming the accessor, from 
a bland rowData() to more meaningful rowRanges(). This did come about from 
discussion of community input (start ing 
https://stat.ethz.ch/pipermail/bioc-devel/2014-November/006686.html), just 
perhaps not consistent with all opinions expressed. We felt it was important to 
get this first step done 'this release', because it frees us to do more 
substantial refactoring immediately after the coming release while allowing 
rowData() a chance to cycle out of existence.

The exact nature of the refactoring implementation is still not decided, but the 
conceptual ideas are to enable a SummarizedExperiment (sub)class that does not 
require a GRanges* rowData, while retaining a SummarizedExperiment (sub)class 
that is based on GRanges rowData / rowRanges.

>>>
>>> I am happy to hear that SummarizedExperiment is going to be spun out into
>>> its own package.  When that happens, I have some comments, which I'll
>>> include here in anticipation
>>>    1) I now very strongly believe it was a design mistake to not have
>>> colnames on the assays.  The advantage of this choice is that sampleNames
>>> are only stored one place.  The extreme disadvantage is the high
>>> ineffeciency when you want colnames on an extracted assay.
>>>
>>
>> after example(SummarizedExperiment)
>>
>>> colnames(assays(se1)[[1]])
>> [1] "A" "B" "C" "D" "E" "F"
>>
>> so this seems to be optional.  But attempts to set rownames will fail
>> silently
>>
>>> rownames(assays(se1)[[1]]) = as.character(1:200)
>>
>>> rownames(assays(se1)[[1]])
>>
>> NULL
>> seems we could issue a warning there

the rownames issue seems to be a bug; simply accessing row and colnames on the 
object itself is sufficient

   > colnames(se1) = tolower(colnames(se1))
   > colnames(se1)
   [1] "a" "b" "c" "d" "e" "f"
   > rownames(se1) = 1:200
   > head(rownames(se1))
   [1] "1" "2" "3" "4" "5" "6"

>>
>
>
> Vince, you need to be careful here.
>
> The assays are stored without colnames (unless something has recently
> changed).  The default is to - upon extraction - set the colnames of the
> matrix.  This however requires a copy of the entire matrix.  So
> essentially, upon extraction, each assay is needlessly duplicated to add
> the colnames.  This is what I mean by inefficient. I would prefer to store

yes this is certainly a bad design decision, and will be corrected.

> the assays with colnames.  This means that changing sampleNames of the
> object will be inefficient (as it is for eSets) since it would require a
> complete copy of everything.  But I would rather - much rather - copy when
> setting sampleNames than copy when extracting an assay.
>
> Best,
> Kasper
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list