[Bioc-devel] SummarizedExperiment with alternate back end

Morgan, Martin Martin.Morgan at roswellpark.org
Sat Sep 19 12:11:24 CEST 2015


Two less baked ideas are

    https://github.com/PaulPyl/h5array

which could be used in the assay() of SummarizedExperiment, and

  https://github.com/nhayden/h5robj

which translates R objects to hdf5.

Martin

> -----Original Message-----
> From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf Of
> Michael Lawrence
> Sent: Friday, September 18, 2015 10:42 PM
> To: Peter Haverty
> Cc: Tim Triche, Jr.; bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] SummarizedExperiment with alternate back end
> 
> While it's useful (and often necessary) to store the big matrices out of core, it
> would be convenient to store the metadata (the other components of the
> object) along with the matrices. Something along the lines of HDF5, but we
> would want to keep things abstract. Other options include GDS (for
> genotypes), and of couse most any database.
> 
> On Fri, Sep 18, 2015 at 6:18 PM, Peter Haverty <haverty.peter at gene.com>
> wrote:
> 
> > While we are on the topic, my GenoSet class will become a subclass of
> > RangedSummarizedExperiment, rather than eSet, after this upcoming
> release.
> > For this release both APIs work (colnames and sampleNames, etc.)
> >
> > I think the range-free SummarizedExperiment will be great. I've seen a
> > lot of ExpressionSets with random, non-exprs stuff in the exprs slot
> > for lack of something more appropriate.
> >
> > Pete
> >
> > ____________________
> > Peter M. Haverty, Ph.D.
> > Genentech, Inc.
> > phaverty at gene.com
> >
> > On Fri, Sep 18, 2015 at 6:09 PM, Ryan <rct at thompsonclan.org> wrote:
> >
> > > In the dev version, SummarizedExperiment has been split into
> > > RangedSummarizedExperiment (equivalent to the current
> > > SummarizedExperiement, with rowRanges) and SummarizedExperiment
> > > (kind of like eSet, no rowRanges). Given that eSet objects also
> > > support multiple assayData elements, I believe the new
> > > SummarizedExperiment is pretty
> > close
> > > to being eSet with different method names. In fact, I wonder if eSet
> > > could/should be reimplemented as a subclass of the new
> > SummarizedExperiment
> > > class.
> > >
> > >
> > > On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
> > >
> > >> Interesting, thanks for the pointer.
> > >>
> > >> In light of the existing (and future) work on this, may I suggest
> > >> an
> > eSet
> > >> like class, but build using the technologies in SummarizedExperiment.
> > Ie.
> > >> a SummarizedExperiment without the rowRanges.  I would very much
> > >> like
> > this
> > >> for modern work using eSet like containers.  Not everything has ranges.
> > >>
> > >> Vince: I am not claiming that it is easy to work with; we have
> > >> pains as well.  But am I missing something or is the assay matrix only
> 2.3Gb?
> > >>
> > >> Best,
> > >> Kasper
> > >>
> > >> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty
> > >> <haverty.peter at gene.com>
> > >> wrote:
> > >>
> > >> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are
> > >> good
> > >>> tricks
> > >>> for reducing the size of your eSets and SummarizedExperiments.
> > >>> Both object types can go into assayData or assays. In fact, that's
> > >>> what they were designed for.
> > >>>
> > >>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular data
> > >>> from Illumina SNP arrays.  We typically have ~6 such rectangular
> > >>> objects in one eSet.  With a mix of BigMatrix object for point
> > >>> estimates and RleDataFrames for segmented data, readRDS times are
> > >>> quite reasonable.
> > >>>
> > >>>
> > >>> Pete
> > >>>
> > >>> ____________________
> > >>> Peter M. Haverty, Ph.D.
> > >>> Genentech, Inc.
> > >>> phaverty at gene.com
> > >>>
> > >>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr.
> > >>> <tim.triche at gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>> bigmemoryExtras (Peter Haverty's extensions to
> > >>> bigMemory/bigMatrix) can
> > >>>>
> > >>> be
> > >>>
> > >>>> handy for this, as it works well as a backend, especially if you
> > >>>> go about splitting by chromosome as for CNV segmentation, DMR
> > >>>> finding, etc.
> > >>>>  It's
> > >>>> not as seamless as one might like, but it's the closest thing
> > >>>> I've found.
> > >>>>
> > >>>> SciDb tries to implement a similar API, but for a distributed
> > >>>> version
> > of
> > >>>> this where the data itself is in a columnar database and served
> > >>>> on
> > >>>>
> > >>> demand.
> > >>>
> > >>>> I tried getting that up and running as a SummarizedExperiment
> > >>>> backend,
> > >>>>
> > >>> but
> > >>>
> > >>>> did not succeed.  I have previously shoveled all of the TCGA 450k
> > >>>> data
> > >>>>
> > >>> into
> > >>>
> > >>>> one 7,000+ column bigMatrix which serializes to about 14GB on disk.
> > >>>>
> > >>>> If you have any replicates in your 700+ samples, it's a good idea
> > >>>> to keep their SNP calls in metadata(yourSE), although if you
> > >>>> change names it
> > >>>>
> > >>> needs
> > >>>
> > >>>> to propagate into the dependent metadata.  This is why I started
> > >>>>
> > >>> monkeying
> > >>>
> > >>>> around with linkedExperiments where those mappings are enforced;
> > >>>> it's becoming more of an issue with the TARGET pediatric AML
> > >>>> study, where
> > >>>>
> > >>> there
> > >>>
> > >>>> are numerous diagnosis-remission-relapse trios whose identity I
> > >>>> wish
> > to
> > >>>> verify periodically.  The SNPs on the 450k array are great for
> > >>>> this purpose, but minfi doesn't really have a slot for them per
> > >>>> se, so live in metadata().
> > >>>>
> > >>>>
> > >>>> --t
> > >>>>
> > >>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
> > >>>>
> > >>> stvjc at channing.harvard.edu
> > >>>
> > >>>> wrote:
> > >>>>
> > >>>> i am dealing with ~700 450k arrays
> > >>>>>
> > >>>>> they are derived from one study, so it makes sense to think of
> > >>>>>
> > >>>>> them holistically.
> > >>>>>
> > >>>>> both the load time and the memory consumption are not
> satisfactory.
> > >>>>>
> > >>>>> has anyone worked on an object type that implements the
> rangedSE
> > >>>>> API
> > >>>>>
> > >>>> but
> > >>>
> > >>>> has
> > >>>>>
> > >>>>> the assay data out of memory?
> > >>>>>
> > >>>>> unix.time(load("wbmse.rda"))
> > >>>>>>
> > >>>>>     user  system elapsed
> > >>>>>
> > >>>>>   30.131   2.396  61.036
> > >>>>>
> > >>>>> object.size(wbmse)
> > >>>>>>
> > >>>>> 124031032 bytes
> > >>>>>
> > >>>>> dim(wbmse)
> > >>>>>>
> > >>>>> [1] 485577    690
> > >>>>>
> > >>>>> object.size(assays(wbmse))
> > >>>>>>
> > >>>>> 2680430992 bytes
> > >>>>>
> > >>>>>          [[alternative HTML version deleted]]
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> Bioc-devel at r-project.org mailing list
> > >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>>>>
> > >>>>>          [[alternative HTML version deleted]]
> > >>>>
> > >>>> _______________________________________________
> > >>>> Bioc-devel at r-project.org mailing list
> > >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>>>
> > >>>>          [[alternative HTML version deleted]]
> > >>>
> > >>> _______________________________________________
> > >>> Bioc-devel at r-project.org mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>>
> > >>>         [[alternative HTML version deleted]]
> > >>
> > >> _______________________________________________
> > >> Bioc-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>
> > >>
> > >>
> > >
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.


More information about the Bioc-devel mailing list