[Bioc-devel] SummarizedExperiment with alternate back end

Sat Sep 19 12:41:45 CEST 2015

A long time ago, I created this (on top of rhdf5), which I often use when I
need to use hdf5 files: https://github.com/benilton/rhdf5utils

On Sat, Sep 19, 2015, 07:12 Morgan, Martin <Martin.Morgan at roswellpark.org>
wrote:

> Two less baked ideas are
>
>     https://github.com/PaulPyl/h5array
>
> which could be used in the assay() of SummarizedExperiment, and
>
>   https://github.com/nhayden/h5robj
>
> which translates R objects to hdf5.
>
> Martin
>
> > -----Original Message-----
> > From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf Of
> > Michael Lawrence
> > Sent: Friday, September 18, 2015 10:42 PM
> > To: Peter Haverty
> > Cc: Tim Triche, Jr.; bioc-devel at r-project.org
> > Subject: Re: [Bioc-devel] SummarizedExperiment with alternate back end
> >
> > While it's useful (and often necessary) to store the big matrices out of
> core, it
> > would be convenient to store the metadata (the other components of the
> > object) along with the matrices. Something along the lines of HDF5, but
> we
> > would want to keep things abstract. Other options include GDS (for
> > genotypes), and of couse most any database.
> >
> > On Fri, Sep 18, 2015 at 6:18 PM, Peter Haverty <haverty.peter at gene.com>
> > wrote:
> >
> > > While we are on the topic, my GenoSet class will become a subclass of
> > > RangedSummarizedExperiment, rather than eSet, after this upcoming
> > release.
> > > For this release both APIs work (colnames and sampleNames, etc.)
> > >
> > > I think the range-free SummarizedExperiment will be great. I've seen a
> > > lot of ExpressionSets with random, non-exprs stuff in the exprs slot
> > > for lack of something more appropriate.
> > >
> > > Pete
> > >
> > > ____________________
> > > Peter M. Haverty, Ph.D.
> > > Genentech, Inc.
> > > phaverty at gene.com
> > >
> > > On Fri, Sep 18, 2015 at 6:09 PM, Ryan <rct at thompsonclan.org> wrote:
> > >
> > > > In the dev version, SummarizedExperiment has been split into
> > > > RangedSummarizedExperiment (equivalent to the current
> > > > SummarizedExperiement, with rowRanges) and SummarizedExperiment
> > > > (kind of like eSet, no rowRanges). Given that eSet objects also
> > > > support multiple assayData elements, I believe the new
> > > > SummarizedExperiment is pretty
> > > close
> > > > to being eSet with different method names. In fact, I wonder if eSet
> > > > could/should be reimplemented as a subclass of the new
> > > SummarizedExperiment
> > > > class.
> > > >
> > > >
> > > > On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
> > > >
> > > >> Interesting, thanks for the pointer.
> > > >>
> > > >> In light of the existing (and future) work on this, may I suggest
> > > >> an
> > > eSet
> > > >> like class, but build using the technologies in
> SummarizedExperiment.
> > > Ie.
> > > >> a SummarizedExperiment without the rowRanges.  I would very much
> > > >> like
> > > this
> > > >> for modern work using eSet like containers.  Not everything has
> ranges.
> > > >>
> > > >> Vince: I am not claiming that it is easy to work with; we have
> > > >> pains as well.  But am I missing something or is the assay matrix
> only
> > 2.3Gb?
> > > >>
> > > >> Best,
> > > >> Kasper
> > > >>
> > > >> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty
> > > >> <haverty.peter at gene.com>
> > > >> wrote:
> > > >>
> > > >> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are
> > > >> good
> > > >>> tricks
> > > >>> for reducing the size of your eSets and SummarizedExperiments.
> > > >>> Both object types can go into assayData or assays. In fact, that's
> > > >>> what they were designed for.
> > > >>>
> > > >>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular data
> > > >>> from Illumina SNP arrays.  We typically have ~6 such rectangular
> > > >>> objects in one eSet.  With a mix of BigMatrix object for point
> > > >>> estimates and RleDataFrames for segmented data, readRDS times are
> > > >>> quite reasonable.
> > > >>>
> > > >>>
> > > >>> Pete
> > > >>>
> > > >>> ____________________
> > > >>> Peter M. Haverty, Ph.D.
> > > >>> Genentech, Inc.
> > > >>> phaverty at gene.com
> > > >>>
> > > >>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr.
> > > >>> <tim.triche at gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> bigmemoryExtras (Peter Haverty's extensions to
> > > >>> bigMemory/bigMatrix) can
> > > >>>>
> > > >>> be
> > > >>>
> > > >>>> handy for this, as it works well as a backend, especially if you
> > > >>>> go about splitting by chromosome as for CNV segmentation, DMR
> > > >>>> finding, etc.
> > > >>>>  It's
> > > >>>> not as seamless as one might like, but it's the closest thing
> > > >>>> I've found.
> > > >>>>
> > > >>>> SciDb tries to implement a similar API, but for a distributed
> > > >>>> version
> > > of
> > > >>>> this where the data itself is in a columnar database and served
> > > >>>> on
> > > >>>>
> > > >>> demand.
> > > >>>
> > > >>>> I tried getting that up and running as a SummarizedExperiment
> > > >>>> backend,
> > > >>>>
> > > >>> but
> > > >>>
> > > >>>> did not succeed.  I have previously shoveled all of the TCGA 450k
> > > >>>> data
> > > >>>>
> > > >>> into
> > > >>>
> > > >>>> one 7,000+ column bigMatrix which serializes to about 14GB on
> disk.
> > > >>>>
> > > >>>> If you have any replicates in your 700+ samples, it's a good idea
> > > >>>> to keep their SNP calls in metadata(yourSE), although if you
> > > >>>> change names it
> > > >>>>
> > > >>> needs
> > > >>>
> > > >>>> to propagate into the dependent metadata.  This is why I started
> > > >>>>
> > > >>> monkeying
> > > >>>
> > > >>>> around with linkedExperiments where those mappings are enforced;
> > > >>>> it's becoming more of an issue with the TARGET pediatric AML
> > > >>>> study, where
> > > >>>>
> > > >>> there
> > > >>>
> > > >>>> are numerous diagnosis-remission-relapse trios whose identity I
> > > >>>> wish
> > > to
> > > >>>> verify periodically.  The SNPs on the 450k array are great for
> > > >>>> this purpose, but minfi doesn't really have a slot for them per
> > > >>>> se, so live in metadata().
> > > >>>>
> > > >>>>
> > > >>>> --t
> > > >>>>
> > > >>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
> > > >>>>
> > > >>> stvjc at channing.harvard.edu
> > > >>>
> > > >>>> wrote:
> > > >>>>
> > > >>>> i am dealing with ~700 450k arrays
> > > >>>>>
> > > >>>>> they are derived from one study, so it makes sense to think of
> > > >>>>>
> > > >>>>> them holistically.
> > > >>>>>
> > > >>>>> both the load time and the memory consumption are not
> > satisfactory.
> > > >>>>>
> > > >>>>> has anyone worked on an object type that implements the
> > rangedSE
> > > >>>>> API
> > > >>>>>
> > > >>>> but
> > > >>>
> > > >>>> has
> > > >>>>>
> > > >>>>> the assay data out of memory?
> > > >>>>>
> > > >>>>> unix.time(load("wbmse.rda"))
> > > >>>>>>
> > > >>>>>     user  system elapsed
> > > >>>>>
> > > >>>>>   30.131   2.396  61.036
> > > >>>>>
> > > >>>>> object.size(wbmse)
> > > >>>>>>
> > > >>>>> 124031032 bytes
> > > >>>>>
> > > >>>>> dim(wbmse)
> > > >>>>>>
> > > >>>>> [1] 485577    690
> > > >>>>>
> > > >>>>> object.size(assays(wbmse))
> > > >>>>>>
> > > >>>>> 2680430992 bytes
> > > >>>>>
> > > >>>>>          [[alternative HTML version deleted]]
> > > >>>>>
> > > >>>>> _______________________________________________
> > > >>>>> Bioc-devel at r-project.org mailing list
> > > >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > >>>>>
> > > >>>>>          [[alternative HTML version deleted]]
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> Bioc-devel at r-project.org mailing list
> > > >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > >>>>
> > > >>>>          [[alternative HTML version deleted]]
> > > >>>
> > > >>> _______________________________________________
> > > >>> Bioc-devel at r-project.org mailing list
> > > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > >>>
> > > >>>         [[alternative HTML version deleted]]
> > > >>
> > > >> _______________________________________________
> > > >> Bioc-devel at r-project.org mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioc-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]