[Bioc-devel] SummarizedExperiment with alternate back end
Peter Haverty
haverty.peter at gene.com
Sat Sep 19 03:18:50 CEST 2015
While we are on the topic, my GenoSet class will become a subclass of
RangedSummarizedExperiment, rather than eSet, after this upcoming release.
For this release both APIs work (colnames and sampleNames, etc.)
I think the range-free SummarizedExperiment will be great. I've seen a lot
of ExpressionSets with random, non-exprs stuff in the exprs slot for lack
of something more appropriate.
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Fri, Sep 18, 2015 at 6:09 PM, Ryan <rct at thompsonclan.org> wrote:
> In the dev version, SummarizedExperiment has been split into
> RangedSummarizedExperiment (equivalent to the current
> SummarizedExperiement, with rowRanges) and SummarizedExperiment (kind of
> like eSet, no rowRanges). Given that eSet objects also support multiple
> assayData elements, I believe the new SummarizedExperiment is pretty close
> to being eSet with different method names. In fact, I wonder if eSet
> could/should be reimplemented as a subclass of the new SummarizedExperiment
> class.
>
>
> On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
>
>> Interesting, thanks for the pointer.
>>
>> In light of the existing (and future) work on this, may I suggest an eSet
>> like class, but build using the technologies in SummarizedExperiment. Ie.
>> a SummarizedExperiment without the rowRanges. I would very much like this
>> for modern work using eSet like containers. Not everything has ranges.
>>
>> Vince: I am not claiming that it is easy to work with; we have pains as
>> well. But am I missing something or is the assay matrix only 2.3Gb?
>>
>> Best,
>> Kasper
>>
>> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty <haverty.peter at gene.com>
>> wrote:
>>
>> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are good
>>> tricks
>>> for reducing the size of your eSets and SummarizedExperiments. Both
>>> object
>>> types can go into assayData or assays. In fact, that's what they were
>>> designed for.
>>>
>>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular data from
>>> Illumina SNP arrays. We typically have ~6 such rectangular objects in
>>> one
>>> eSet. With a mix of BigMatrix object for point estimates and
>>> RleDataFrames
>>> for segmented data, readRDS times are quite reasonable.
>>>
>>>
>>> Pete
>>>
>>> ____________________
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phaverty at gene.com
>>>
>>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>> wrote:
>>>
>>> bigmemoryExtras (Peter Haverty's extensions to bigMemory/bigMatrix) can
>>>>
>>> be
>>>
>>>> handy for this, as it works well as a backend, especially if you go
>>>> about
>>>> splitting by chromosome as for CNV segmentation, DMR finding, etc.
>>>> It's
>>>> not as seamless as one might like, but it's the closest thing I've
>>>> found.
>>>>
>>>> SciDb tries to implement a similar API, but for a distributed version of
>>>> this where the data itself is in a columnar database and served on
>>>>
>>> demand.
>>>
>>>> I tried getting that up and running as a SummarizedExperiment backend,
>>>>
>>> but
>>>
>>>> did not succeed. I have previously shoveled all of the TCGA 450k data
>>>>
>>> into
>>>
>>>> one 7,000+ column bigMatrix which serializes to about 14GB on disk.
>>>>
>>>> If you have any replicates in your 700+ samples, it's a good idea to
>>>> keep
>>>> their SNP calls in metadata(yourSE), although if you change names it
>>>>
>>> needs
>>>
>>>> to propagate into the dependent metadata. This is why I started
>>>>
>>> monkeying
>>>
>>>> around with linkedExperiments where those mappings are enforced; it's
>>>> becoming more of an issue with the TARGET pediatric AML study, where
>>>>
>>> there
>>>
>>>> are numerous diagnosis-remission-relapse trios whose identity I wish to
>>>> verify periodically. The SNPs on the 450k array are great for this
>>>> purpose, but minfi doesn't really have a slot for them per se, so live
>>>> in
>>>> metadata().
>>>>
>>>>
>>>> --t
>>>>
>>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
>>>>
>>> stvjc at channing.harvard.edu
>>>
>>>> wrote:
>>>>
>>>> i am dealing with ~700 450k arrays
>>>>>
>>>>> they are derived from one study, so it makes sense to think of
>>>>>
>>>>> them holistically.
>>>>>
>>>>> both the load time and the memory consumption are not satisfactory.
>>>>>
>>>>> has anyone worked on an object type that implements the rangedSE API
>>>>>
>>>> but
>>>
>>>> has
>>>>>
>>>>> the assay data out of memory?
>>>>>
>>>>> unix.time(load("wbmse.rda"))
>>>>>>
>>>>> user system elapsed
>>>>>
>>>>> 30.131 2.396 61.036
>>>>>
>>>>> object.size(wbmse)
>>>>>>
>>>>> 124031032 bytes
>>>>>
>>>>> dim(wbmse)
>>>>>>
>>>>> [1] 485577 690
>>>>>
>>>>> object.size(assays(wbmse))
>>>>>>
>>>>> 2680430992 bytes
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list