[Bioc-devel] Changes to the SummarizedExperiment Class

Michael Love michaelisaiahlove at gmail.com
Mon Mar 9 16:07:41 CET 2015


Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.

Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.


On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>> I am glad you are keeping this discussion alive Kasper.
>>
>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>
>>> It sounds like the proposed changes are already made.  However (like
>>> others) I am still a bit mystified why this was necessary.  The old
>>> version
>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>>> data/use cases is this for?
>>>
>>> I am happy to hear that SummarizedExperiment is going to be spun out into
>>> its own package.  When that happens, I have some comments, which I'll
>>> include here in anticipation
>>>   1) I now very strongly believe it was a design mistake to not have
>>> colnames on the assays.  The advantage of this choice is that sampleNames
>>> are only stored one place.  The extreme disadvantage is the high
>>> ineffeciency when you want colnames on an extracted assay.
>>>
>>
>> after example(SummarizedExperiment)
>>
>> > colnames(assays(se1)[[1]])
>> [1] "A" "B" "C" "D" "E" "F"
>>
>> so this seems to be optional.  But attempts to set rownames will fail
>> silently
>>
>> > rownames(assays(se1)[[1]]) = as.character(1:200)
>>
>> > rownames(assays(se1)[[1]])
>>
>> NULL
>> seems we could issue a warning there
>>
>
>
> Vince, you need to be careful here.
>
> The assays are stored without colnames (unless something has recently
> changed).  The default is to - upon extraction - set the colnames of the
> matrix.  This however requires a copy of the entire matrix.  So
> essentially, upon extraction, each assay is needlessly duplicated to add
> the colnames.  This is what I mean by inefficient. I would prefer to store
> the assays with colnames.  This means that changing sampleNames of the
> object will be inefficient (as it is for eSets) since it would require a
> complete copy of everything.  But I would rather - much rather - copy when
> setting sampleNames than copy when extracting an assay.
>
> Best,
> Kasper
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list