[Bioc-devel] SummarizedExperiment with alternate back end

Michael Lawrence lawrence.michael at gene.com
Sat Sep 19 04:41:30 CEST 2015


While it's useful (and often necessary) to store the big matrices out of
core, it would be convenient to store the metadata (the other components of
the object) along with the matrices. Something along the lines of HDF5, but
we would want to keep things abstract. Other options include GDS (for
genotypes), and of couse most any database.

On Fri, Sep 18, 2015 at 6:18 PM, Peter Haverty <haverty.peter at gene.com>
wrote:

> While we are on the topic, my GenoSet class will become a subclass of
> RangedSummarizedExperiment, rather than eSet, after this upcoming release.
> For this release both APIs work (colnames and sampleNames, etc.)
>
> I think the range-free SummarizedExperiment will be great. I've seen a lot
> of ExpressionSets with random, non-exprs stuff in the exprs slot for lack
> of something more appropriate.
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Fri, Sep 18, 2015 at 6:09 PM, Ryan <rct at thompsonclan.org> wrote:
>
> > In the dev version, SummarizedExperiment has been split into
> > RangedSummarizedExperiment (equivalent to the current
> > SummarizedExperiement, with rowRanges) and SummarizedExperiment (kind of
> > like eSet, no rowRanges). Given that eSet objects also support multiple
> > assayData elements, I believe the new SummarizedExperiment is pretty
> close
> > to being eSet with different method names. In fact, I wonder if eSet
> > could/should be reimplemented as a subclass of the new
> SummarizedExperiment
> > class.
> >
> >
> > On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
> >
> >> Interesting, thanks for the pointer.
> >>
> >> In light of the existing (and future) work on this, may I suggest an
> eSet
> >> like class, but build using the technologies in SummarizedExperiment.
> Ie.
> >> a SummarizedExperiment without the rowRanges.  I would very much like
> this
> >> for modern work using eSet like containers.  Not everything has ranges.
> >>
> >> Vince: I am not claiming that it is easy to work with; we have pains as
> >> well.  But am I missing something or is the assay matrix only 2.3Gb?
> >>
> >> Best,
> >> Kasper
> >>
> >> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty <haverty.peter at gene.com>
> >> wrote:
> >>
> >> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are good
> >>> tricks
> >>> for reducing the size of your eSets and SummarizedExperiments.  Both
> >>> object
> >>> types can go into assayData or assays. In fact, that's what they were
> >>> designed for.
> >>>
> >>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular data from
> >>> Illumina SNP arrays.  We typically have ~6 such rectangular objects in
> >>> one
> >>> eSet.  With a mix of BigMatrix object for point estimates and
> >>> RleDataFrames
> >>> for segmented data, readRDS times are quite reasonable.
> >>>
> >>>
> >>> Pete
> >>>
> >>> ____________________
> >>> Peter M. Haverty, Ph.D.
> >>> Genentech, Inc.
> >>> phaverty at gene.com
> >>>
> >>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr. <tim.triche at gmail.com
> >
> >>> wrote:
> >>>
> >>> bigmemoryExtras (Peter Haverty's extensions to bigMemory/bigMatrix) can
> >>>>
> >>> be
> >>>
> >>>> handy for this, as it works well as a backend, especially if you go
> >>>> about
> >>>> splitting by chromosome as for CNV segmentation, DMR finding, etc.
> >>>>  It's
> >>>> not as seamless as one might like, but it's the closest thing I've
> >>>> found.
> >>>>
> >>>> SciDb tries to implement a similar API, but for a distributed version
> of
> >>>> this where the data itself is in a columnar database and served on
> >>>>
> >>> demand.
> >>>
> >>>> I tried getting that up and running as a SummarizedExperiment backend,
> >>>>
> >>> but
> >>>
> >>>> did not succeed.  I have previously shoveled all of the TCGA 450k data
> >>>>
> >>> into
> >>>
> >>>> one 7,000+ column bigMatrix which serializes to about 14GB on disk.
> >>>>
> >>>> If you have any replicates in your 700+ samples, it's a good idea to
> >>>> keep
> >>>> their SNP calls in metadata(yourSE), although if you change names it
> >>>>
> >>> needs
> >>>
> >>>> to propagate into the dependent metadata.  This is why I started
> >>>>
> >>> monkeying
> >>>
> >>>> around with linkedExperiments where those mappings are enforced; it's
> >>>> becoming more of an issue with the TARGET pediatric AML study, where
> >>>>
> >>> there
> >>>
> >>>> are numerous diagnosis-remission-relapse trios whose identity I wish
> to
> >>>> verify periodically.  The SNPs on the 450k array are great for this
> >>>> purpose, but minfi doesn't really have a slot for them per se, so live
> >>>> in
> >>>> metadata().
> >>>>
> >>>>
> >>>> --t
> >>>>
> >>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
> >>>>
> >>> stvjc at channing.harvard.edu
> >>>
> >>>> wrote:
> >>>>
> >>>> i am dealing with ~700 450k arrays
> >>>>>
> >>>>> they are derived from one study, so it makes sense to think of
> >>>>>
> >>>>> them holistically.
> >>>>>
> >>>>> both the load time and the memory consumption are not satisfactory.
> >>>>>
> >>>>> has anyone worked on an object type that implements the rangedSE API
> >>>>>
> >>>> but
> >>>
> >>>> has
> >>>>>
> >>>>> the assay data out of memory?
> >>>>>
> >>>>> unix.time(load("wbmse.rda"))
> >>>>>>
> >>>>>     user  system elapsed
> >>>>>
> >>>>>   30.131   2.396  61.036
> >>>>>
> >>>>> object.size(wbmse)
> >>>>>>
> >>>>> 124031032 bytes
> >>>>>
> >>>>> dim(wbmse)
> >>>>>>
> >>>>> [1] 485577    690
> >>>>>
> >>>>> object.size(assays(wbmse))
> >>>>>>
> >>>>> 2680430992 bytes
> >>>>>
> >>>>>          [[alternative HTML version deleted]]
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioc-devel at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>
> >>>>>          [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> Bioc-devel at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>
> >>>>          [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>
> >>>         [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >>
> >
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list