[Bioc-devel] Changes to the SummarizedExperiment Class

Martin Morgan mtmorgan at fredhutch.org
Mon Mar 9 17:36:00 CET 2015


On 03/09/2015 08:07 AM, Michael Love wrote:
> Some guidance on how to avoid duplication of the matrix for developers
> would be greatly appreciated.

It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
extraction of assays (but obviously you don't have dimnames on the matrix). Row 
or column subsetting necessarily causes the subsetted assay data to be 
duplicated. There should not be any duplication when rowRanges() or colData() 
are changed without changing their dimension / ordering.

> Another example of a trouble point, is that if I am given an SE with
> an unnamed assay and I need to give the assay a name, this also can
> expand the memory used. I had found a solution (which works with
> GenomicRanges 1.18 / current release) with:
>
> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>
> But now I'm looking in devel and this appears to no longer work. The
> memory used expands, equivalent to:
>
> names(assays(se))[1] <- "foo"
>
> Here's some code to try this:
>
> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
> se <- SummarizedExperiment(m)
> names(assays(se, withDimnames=FALSE))[1] <- "foo"
> names(assays(se))[1] <- "foo"
>
> while running gc() in between steps.

I think this is a regression of some sort, and I'll look into it. Thanks for the 
heads-up.

Martin

>
>
> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <stvjc at channing.harvard.edu>
>> wrote:
>>
>>> I am glad you are keeping this discussion alive Kasper.
>>>
>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>> It sounds like the proposed changes are already made.  However (like
>>>> others) I am still a bit mystified why this was necessary.  The old
>>>> version
>>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>>>> data/use cases is this for?
>>>>
>>>> I am happy to hear that SummarizedExperiment is going to be spun out into
>>>> its own package.  When that happens, I have some comments, which I'll
>>>> include here in anticipation
>>>>    1) I now very strongly believe it was a design mistake to not have
>>>> colnames on the assays.  The advantage of this choice is that sampleNames
>>>> are only stored one place.  The extreme disadvantage is the high
>>>> ineffeciency when you want colnames on an extracted assay.
>>>>
>>>
>>> after example(SummarizedExperiment)
>>>
>>>> colnames(assays(se1)[[1]])
>>> [1] "A" "B" "C" "D" "E" "F"
>>>
>>> so this seems to be optional.  But attempts to set rownames will fail
>>> silently
>>>
>>>> rownames(assays(se1)[[1]]) = as.character(1:200)
>>>
>>>> rownames(assays(se1)[[1]])
>>>
>>> NULL
>>> seems we could issue a warning there
>>>
>>
>>
>> Vince, you need to be careful here.
>>
>> The assays are stored without colnames (unless something has recently
>> changed).  The default is to - upon extraction - set the colnames of the
>> matrix.  This however requires a copy of the entire matrix.  So
>> essentially, upon extraction, each assay is needlessly duplicated to add
>> the colnames.  This is what I mean by inefficient. I would prefer to store
>> the assays with colnames.  This means that changing sampleNames of the
>> object will be inefficient (as it is for eSets) since it would require a
>> complete copy of everything.  But I would rather - much rather - copy when
>> setting sampleNames than copy when extracting an assay.
>>
>> Best,
>> Kasper
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list