[Bioc-devel] Changes to the SummarizedExperiment Class
mtmorgan at fredhutch.org
Mon Mar 9 17:22:10 CET 2015
On 03/09/2015 07:36 AM, Kasper Daniel Hansen wrote:
> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <stvjc at channing.harvard.edu>
>> I am glad you are keeping this discussion alive Kasper.
>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>> It sounds like the proposed changes are already made. However (like
>>> others) I am still a bit mystified why this was necessary. The old
>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>>> recall. So I assume this is for efficiency. But why? What kind of
>>> data/use cases is this for?
Actually the design has GRanges on the 'outside'; a DataFrame can be emulated
with GRangesList of 0 elements, with mcols() the DataFrame, but this is
obviously a hack. Simon Anders argued perhaps 5 years ago for DataFrame on the
'outside', allowing for GRanges on the inside; maybe that would have been a
better original design, but I guess I was stuck on the the defining
characteristic of sequencing experiments being range-based, the expressive power
of reliably overlapping say ranges of differentially expressed genes with ranges
of variants or ChIP binding sites, and a desire not to introduce a plethora
(e.g., 2) of classes.
You can think of what has been done so far as simply renaming the accessor, from
a bland rowData() to more meaningful rowRanges(). This did come about from
discussion of community input (start ing
perhaps not consistent with all opinions expressed. We felt it was important to
get this first step done 'this release', because it frees us to do more
substantial refactoring immediately after the coming release while allowing
rowData() a chance to cycle out of existence.
The exact nature of the refactoring implementation is still not decided, but the
conceptual ideas are to enable a SummarizedExperiment (sub)class that does not
require a GRanges* rowData, while retaining a SummarizedExperiment (sub)class
that is based on GRanges rowData / rowRanges.
>>> I am happy to hear that SummarizedExperiment is going to be spun out into
>>> its own package. When that happens, I have some comments, which I'll
>>> include here in anticipation
>>> 1) I now very strongly believe it was a design mistake to not have
>>> colnames on the assays. The advantage of this choice is that sampleNames
>>> are only stored one place. The extreme disadvantage is the high
>>> ineffeciency when you want colnames on an extracted assay.
>> after example(SummarizedExperiment)
>>  "A" "B" "C" "D" "E" "F"
>> so this seems to be optional. But attempts to set rownames will fail
>>> rownames(assays(se1)[]) = as.character(1:200)
>> seems we could issue a warning there
the rownames issue seems to be a bug; simply accessing row and colnames on the
object itself is sufficient
> colnames(se1) = tolower(colnames(se1))
 "a" "b" "c" "d" "e" "f"
> rownames(se1) = 1:200
 "1" "2" "3" "4" "5" "6"
> Vince, you need to be careful here.
> The assays are stored without colnames (unless something has recently
> changed). The default is to - upon extraction - set the colnames of the
> matrix. This however requires a copy of the entire matrix. So
> essentially, upon extraction, each assay is needlessly duplicated to add
> the colnames. This is what I mean by inefficient. I would prefer to store
yes this is certainly a bad design decision, and will be corrected.
> the assays with colnames. This means that changing sampleNames of the
> object will be inefficient (as it is for eSets) since it would require a
> complete copy of everything. But I would rather - much rather - copy when
> setting sampleNames than copy when extracting an assay.
> [[alternative HTML version deleted]]
> Bioc-devel at r-project.org mailing list
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel