[Bioc-devel] Changes to the SummarizedExperiment Class

Michael Love michaelisaiahlove at gmail.com
Wed Apr 1 21:59:04 CEST 2015


Yes, you're right! Sorry for the noise. I forgot this was how it
always behaved. All I had to do was change the argument name.

On Wed, Apr 1, 2015 at 3:51 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi Michael,
>
> On 04/01/2015 07:17 AM, Michael Love wrote:
>>
>> I'll retract those last two emails about empty GRanges. That's simply:
>>
>> se <- SummarizedExperiment(assays, colData=colData)
>> mcols(se) <- myDataFrame
>
>
> Glad you found a simple way to do what you wanted.
>
> More below...
>
>>
>> On Tue, Mar 31, 2015 at 4:40 PM, Michael Love
>> <michaelisaiahlove at gmail.com> wrote:
>>>
>>> Would this code inspired by the release version of GenomicRanges work?
>>> e.g. if I want to add a DataFrame with 10 rows:
>>>
>>> names <- letters[1:10]
>>> x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
>>> mcols(x) <- DataFrame(foo=1:10)
>>>
>>> Then give x to the rowRanges argument of SummarizedExperiment?
>>>
>>> On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
>>> <michaelisaiahlove at gmail.com> wrote:
>>>>
>>>> I forgot to ask my other question. I had gone in early March and fixed
>>>> my code to eliminate rowData<-, but the argument to SummarizedExperiment
>>>> was still called rowData, and a DataFrame could be provided. Then I
>>>> didn't check for a few weeks, but the argument for the rowData slot is
>>>> now called rowRanges. What's the trick to putting a DataFrame on an
>>>> empty GRanges, so I can get the old behavior but now using the rowRanges
>>>> argument?
>
>
> I'm not sure what you meant by "so I can get the old behavior but
> now using the rowRanges argument".
>
> Just to clarify: the renaming of rowData to rowRanges is a change
> of name only, not a change of behavior. More precisely the new
> rowRanges() accessor should behave exactly as the old rowData()
> accessor. The same applies to the 'rowRanges' argument of the
> SummarizedExperiment() constructor. So whatever you were passing
> before to the 'rowData' argument, you should still be able to pass
> it to the new 'rowRanges' argument. Please let us know if it's not
> the case as this is certainly not intended.
>
> Thanks,
> H.
>
>
>>>>
>>>> On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
>>>> <michaelisaiahlove at gmail.com> wrote:
>>>>>
>>>>> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
>>>>> first assay and duplication of memory from my March 9 email. I tried
>>>>> assayNames<- as well. My use case is if I am given a
>>>>> SummarizedExperiment where the first element is not named "counts"
>>>>> (albeit the SE is most likely coming from summarizeOverlaps() and
>>>>> already named "counts"...).
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R Under development (unstable) (2015-03-31 r68129)
>>>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>>>> Running under: OS X 10.8.5 (Mountain Lion)
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>>>>>     methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
>>>>> S4Vectors_0.5.22
>>>>> [5] BiocGenerics_0.13.10  testthat_0.9.1        devtools_1.7.0
>>>>> knitr_1.9
>>>>> [9] BiocInstaller_1.17.6
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] formatR_1.1    XVector_0.7.4  tools_3.3.0    stringr_0.6.2
>>>>> evaluate_0.5.5
>>>>>
>>>>> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>>>>> <michaelisaiahlove at gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmorgan at fredhutch.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 03/09/2015 08:07 AM, Michael Love wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Some guidance on how to avoid duplication of the matrix for
>>>>>>>> developers
>>>>>>>> would be greatly appreciated.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It's unsatisfactory, but using withDimnames=FALSE avoids duplication
>>>>>>> on extraction of assays (but obviously you don't have dimnames on the
>>>>>>> matrix). Row or column subsetting necessarily causes the subsetted assay
>>>>>>> data to be duplicated. There should not be any duplication when rowRanges()
>>>>>>> or colData() are changed without changing their dimension / ordering.
>>>>>>>
>>>>>>
>>>>>> Thanks Martin for checking into the regression.
>>>>>>
>>>>>> Sorry, I should have been more specific earlier, I meant more
>>>>>> guidance/documentation in the man page for SE. I scanned the 'Extension'
>>>>>> section but didn't find a note on withDimnames for extracting the matrix or
>>>>>> this example of renaming the assays (it seems like this could easily be
>>>>>> relevant for other package authors).
>>>>>>
>>>>>> A prominent note there might help devs write more memory efficient
>>>>>> packages.
>>>>>>
>>>>>> The argument section mentions speed but I'd explicitly mention memory
>>>>>> given that we're often storing big matrices:
>>>>>>
>>>>>> "Setting withDimnames=FALSE  increases the speed with which assays are
>>>>>> extracted."
>>>>>>
>>>>>> (its entirely possible the info is there but i missed it)
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>>
>>>>>>>> Another example of a trouble point, is that if I am given an SE with
>>>>>>>> an unnamed assay and I need to give the assay a name, this also can
>>>>>>>> expand the memory used. I had found a solution (which works with
>>>>>>>> GenomicRanges 1.18 / current release) with:
>>>>>>>>
>>>>>>>> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>>>>>>>
>>>>>>>> But now I'm looking in devel and this appears to no longer work. The
>>>>>>>> memory used expands, equivalent to:
>>>>>>>>
>>>>>>>> names(assays(se))[1] <- "foo"
>>>>>>>>
>>>>>>>> Here's some code to try this:
>>>>>>>>
>>>>>>>> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>>>>>>>> se <- SummarizedExperiment(m)
>>>>>>>> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>>>>>>> names(assays(se))[1] <- "foo"
>>>>>>>>
>>>>>>>> while running gc() in between steps.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think this is a regression of some sort, and I'll look into it.
>>>>>>> Thanks for the heads-up.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>>>>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey
>>>>>>>>> <stvjc at channing.harvard.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am glad you are keeping this discussion alive Kasper.
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>>>>>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> It sounds like the proposed changes are already made.  However
>>>>>>>>>>> (like
>>>>>>>>>>> others) I am still a bit mystified why this was necessary.  The
>>>>>>>>>>> old
>>>>>>>>>>> version
>>>>>>>>>>> did allow for a GRanges inside the DataFrame of the rowData, as
>>>>>>>>>>> far as I
>>>>>>>>>>> recall.  So I assume this is for efficiency.  But why?  What kind
>>>>>>>>>>> of
>>>>>>>>>>> data/use cases is this for?
>>>>>>>>>>>
>>>>>>>>>>> I am happy to hear that SummarizedExperiment is going to be spun
>>>>>>>>>>> out into
>>>>>>>>>>> its own package.  When that happens, I have some comments, which
>>>>>>>>>>> I'll
>>>>>>>>>>> include here in anticipation
>>>>>>>>>>>     1) I now very strongly believe it was a design mistake to not
>>>>>>>>>>> have
>>>>>>>>>>> colnames on the assays.  The advantage of this choice is that
>>>>>>>>>>> sampleNames
>>>>>>>>>>> are only stored one place.  The extreme disadvantage is the high
>>>>>>>>>>> ineffeciency when you want colnames on an extracted assay.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> after example(SummarizedExperiment)
>>>>>>>>>>
>>>>>>>>>>> colnames(assays(se1)[[1]])
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] "A" "B" "C" "D" "E" "F"
>>>>>>>>>>
>>>>>>>>>> so this seems to be optional.  But attempts to set rownames will
>>>>>>>>>> fail
>>>>>>>>>> silently
>>>>>>>>>>
>>>>>>>>>>> rownames(assays(se1)[[1]]) = as.character(1:200)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> rownames(assays(se1)[[1]])
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> NULL
>>>>>>>>>> seems we could issue a warning there
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Vince, you need to be careful here.
>>>>>>>>>
>>>>>>>>> The assays are stored without colnames (unless something has
>>>>>>>>> recently
>>>>>>>>> changed).  The default is to - upon extraction - set the colnames
>>>>>>>>> of the
>>>>>>>>> matrix.  This however requires a copy of the entire matrix.  So
>>>>>>>>> essentially, upon extraction, each assay is needlessly duplicated
>>>>>>>>> to add
>>>>>>>>> the colnames.  This is what I mean by inefficient. I would prefer
>>>>>>>>> to store
>>>>>>>>> the assays with colnames.  This means that changing sampleNames of
>>>>>>>>> the
>>>>>>>>> object will be inefficient (as it is for eSets) since it would
>>>>>>>>> require a
>>>>>>>>> complete copy of everything.  But I would rather - much rather -
>>>>>>>>> copy when
>>>>>>>>> setting sampleNames than copy when extracting an assay.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kasper
>>>>>>>>>
>>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioc-devel mailing list