[Bioc-devel] SummarizedExperiment with alternate back end

Ryan rct at thompsonclan.org
Sat Sep 19 04:58:04 CEST 2015


For what it's worth, I've written a class which I have creatively named 
SubsettableListOfArrays which basically taking the "subset everything 
together" aspect of eSet and SummarizedExperiment and making it as 
generic as possible. It's basically like (non-ranged) 
SummarizedExperiment, except that, like the assays slot, everything can 
have multiple elements, and you can also have 1-dimensional vectors 
associated with rows and columns. The implementation is the most 
straightforward you can imagine, and is not at all optimized, but it 
works. The only contract that is required to store things in it is that 
they be subsettable in the appropriate way. As an example use case, you 
might use it to store a SummarizedExperiment, and then store the DGEList 
that you create from it, and then also store the fit object from glmFit 
as row data, and then store the result table as another row data object, 
and so on, and store an entire edgeR analysis in it, and maybe DESeq2 
and limma-voom analyses of the same data as well. I haven't actually 
felt the need to do that yet, so at the moment it's mostly a proof of 
concept. I'm not actually using it for anything.

If anyone's interested, you can get it here: 
http://mneme.homenet.org/~ryan/SubsettableListOfArrays.R

-Ryan

On 9/18/15 7:41 PM, Michael Lawrence wrote:
> While it's useful (and often necessary) to store the big matrices out 
> of core, it would be convenient to store the metadata (the other 
> components of the object) along with the matrices. Something along the 
> lines of HDF5, but we would want to keep things abstract. Other 
> options include GDS (for genotypes), and of couse most any database.
>
> On Fri, Sep 18, 2015 at 6:18 PM, Peter Haverty <haverty.peter at gene.com 
> <mailto:haverty.peter at gene.com>> wrote:
>
>     While we are on the topic, my GenoSet class will become a subclass of
>     RangedSummarizedExperiment, rather than eSet, after this upcoming
>     release.
>     For this release both APIs work (colnames and sampleNames, etc.)
>
>     I think the range-free SummarizedExperiment will be great. I've
>     seen a lot
>     of ExpressionSets with random, non-exprs stuff in the exprs slot
>     for lack
>     of something more appropriate.
>
>     Pete
>
>     ____________________
>     Peter M. Haverty, Ph.D.
>     Genentech, Inc.
>     phaverty at gene.com <mailto:phaverty at gene.com>
>
>     On Fri, Sep 18, 2015 at 6:09 PM, Ryan <rct at thompsonclan.org
>     <mailto:rct at thompsonclan.org>> wrote:
>
>     > In the dev version, SummarizedExperiment has been split into
>     > RangedSummarizedExperiment (equivalent to the current
>     > SummarizedExperiement, with rowRanges) and SummarizedExperiment
>     (kind of
>     > like eSet, no rowRanges). Given that eSet objects also support
>     multiple
>     > assayData elements, I believe the new SummarizedExperiment is
>     pretty close
>     > to being eSet with different method names. In fact, I wonder if eSet
>     > could/should be reimplemented as a subclass of the new
>     SummarizedExperiment
>     > class.
>     >
>     >
>     > On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
>     >
>     >> Interesting, thanks for the pointer.
>     >>
>     >> In light of the existing (and future) work on this, may I
>     suggest an eSet
>     >> like class, but build using the technologies in
>     SummarizedExperiment.  Ie.
>     >> a SummarizedExperiment without the rowRanges. I would very much
>     like this
>     >> for modern work using eSet like containers. Not everything has
>     ranges.
>     >>
>     >> Vince: I am not claiming that it is easy to work with; we have
>     pains as
>     >> well.  But am I missing something or is the assay matrix only
>     2.3Gb?
>     >>
>     >> Best,
>     >> Kasper
>     >>
>     >> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty
>     <haverty.peter at gene.com <mailto:haverty.peter at gene.com>>
>     >> wrote:
>     >>
>     >> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are
>     good
>     >>> tricks
>     >>> for reducing the size of your eSets and
>     SummarizedExperiments.  Both
>     >>> object
>     >>> types can go into assayData or assays. In fact, that's what
>     they were
>     >>> designed for.
>     >>>
>     >>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular
>     data from
>     >>> Illumina SNP arrays.  We typically have ~6 such rectangular
>     objects in
>     >>> one
>     >>> eSet.  With a mix of BigMatrix object for point estimates and
>     >>> RleDataFrames
>     >>> for segmented data, readRDS times are quite reasonable.
>     >>>
>     >>>
>     >>> Pete
>     >>>
>     >>> ____________________
>     >>> Peter M. Haverty, Ph.D.
>     >>> Genentech, Inc.
>     >>> phaverty at gene.com <mailto:phaverty at gene.com>
>     >>>
>     >>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr.
>     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>     >>> wrote:
>     >>>
>     >>> bigmemoryExtras (Peter Haverty's extensions to
>     bigMemory/bigMatrix) can
>     >>>>
>     >>> be
>     >>>
>     >>>> handy for this, as it works well as a backend, especially if
>     you go
>     >>>> about
>     >>>> splitting by chromosome as for CNV segmentation, DMR finding,
>     etc.
>     >>>>  It's
>     >>>> not as seamless as one might like, but it's the closest thing
>     I've
>     >>>> found.
>     >>>>
>     >>>> SciDb tries to implement a similar API, but for a distributed
>     version of
>     >>>> this where the data itself is in a columnar database and
>     served on
>     >>>>
>     >>> demand.
>     >>>
>     >>>> I tried getting that up and running as a SummarizedExperiment
>     backend,
>     >>>>
>     >>> but
>     >>>
>     >>>> did not succeed.  I have previously shoveled all of the TCGA
>     450k data
>     >>>>
>     >>> into
>     >>>
>     >>>> one 7,000+ column bigMatrix which serializes to about 14GB on
>     disk.
>     >>>>
>     >>>> If you have any replicates in your 700+ samples, it's a good
>     idea to
>     >>>> keep
>     >>>> their SNP calls in metadata(yourSE), although if you change
>     names it
>     >>>>
>     >>> needs
>     >>>
>     >>>> to propagate into the dependent metadata.  This is why I started
>     >>>>
>     >>> monkeying
>     >>>
>     >>>> around with linkedExperiments where those mappings are
>     enforced; it's
>     >>>> becoming more of an issue with the TARGET pediatric AML
>     study, where
>     >>>>
>     >>> there
>     >>>
>     >>>> are numerous diagnosis-remission-relapse trios whose identity
>     I wish to
>     >>>> verify periodically.  The SNPs on the 450k array are great
>     for this
>     >>>> purpose, but minfi doesn't really have a slot for them per
>     se, so live
>     >>>> in
>     >>>> metadata().
>     >>>>
>     >>>>
>     >>>> --t
>     >>>>
>     >>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
>     >>>>
>     >>> stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>
>     >>>
>     >>>> wrote:
>     >>>>
>     >>>> i am dealing with ~700 450k arrays
>     >>>>>
>     >>>>> they are derived from one study, so it makes sense to think of
>     >>>>>
>     >>>>> them holistically.
>     >>>>>
>     >>>>> both the load time and the memory consumption are not
>     satisfactory.
>     >>>>>
>     >>>>> has anyone worked on an object type that implements the
>     rangedSE API
>     >>>>>
>     >>>> but
>     >>>
>     >>>> has
>     >>>>>
>     >>>>> the assay data out of memory?
>     >>>>>
>     >>>>> unix.time(load("wbmse.rda"))
>     >>>>>>
>     >>>>>     user  system elapsed
>     >>>>>
>     >>>>>   30.131   2.396  61.036
>     >>>>>
>     >>>>> object.size(wbmse)
>     >>>>>>
>     >>>>> 124031032 bytes
>     >>>>>
>     >>>>> dim(wbmse)
>     >>>>>>
>     >>>>> [1] 485577    690
>     >>>>>
>     >>>>> object.size(assays(wbmse))
>     >>>>>>
>     >>>>> 2680430992 bytes
>     >>>>>
>     >>>>>          [[alternative HTML version deleted]]
>     >>>>>
>     >>>>> _______________________________________________
>     >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>     >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>     >>>>>
>     >>>>>          [[alternative HTML version deleted]]
>     >>>>
>     >>>> _______________________________________________
>     >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>     >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>     >>>>
>     >>>>          [[alternative HTML version deleted]]
>     >>>
>     >>> _______________________________________________
>     >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>     >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>     >>>
>     >>>         [[alternative HTML version deleted]]
>     >>
>     >> _______________________________________________
>     >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>     >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>     >>
>     >>
>     >>
>     >
>     >
>
>             [[alternative HTML version deleted]]
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>     list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list