[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?

Martin Morgan mtmorgan at fhcrc.org
Fri Sep 21 15:50:36 CEST 2012

On 09/20/2012 06:47 PM, Michael Lawrence wrote:
> Thanks Vince,
> I think we're on the same page. I agree that a set of ranges-of-interest
> are not always appropriate, and that the most basic structure would be a
> table of samples X assays, with missing values. Ranges-of-interest can be
> layered on top when desired. There are many aspects of SummarizedExperiment
> that I would want to carry over, especially the idea of metadata, on the
> samples, assays, and features (when applicable).
> Michael
> On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>> I'll comment briefly because I think this is a strategically important
>> topic and
>> I have done a little bit on integration in various forms.
>> My view of SummarizedExperiment is that it updates the eSet concept to
>> promote range-based indexing of assay features.  The 'assays' component
>> is limited to matrix/array like things and my sense is that the

It might help to nail down a more precise 'API' for what can be in the 
assays slot, but I think it would be definitely array-like; no need for 
it to be an actual 'matrix', though.

>> "Summarized"
>> implies that the intention is for a memory-tractable, serializable
>> reduction of
>> an experiment applied to all of a fixed set of samples.
>> I felt that what Michael was describing departs significantly from
>> these conditions/aims
>> in various ways -- there are multiple assays, possibly at different
>> stages of summarization, and one
>> wants a coherent path to interaction with these, requiring less uniformity
>> of
>> structure.  Entities to be covered are, roughly, a set of biological

A major task I think would be management of on-disk resources, 
guaranteeing in some way that the object is not tied to some fragile 
local disk structure.

The heterogeneity of data types also seems like a significant departure.

>> samples, mostly assayed in the same ways, but the assays do not imply a
>> common
>> set of measurements on a fixed set of ranges.
>> One possible term for the data structure described by Michael is
>> "ExperimentHub".  This

a nice term.


>> would include references to various external data resources and it
>> would have methods
>> for traversing the resources for certain objectives.  Instead of
>> nesting the SummarizedExperiment
>> structures, we could think of certain traversals culminating in
>> SummarizedExperiment instances.
>> I think this would lead to high-level workflow prescriptions that
>> could be broadly applicable --
>> say you have VCFs and BAMs on a collection of samples with some gaps,
>> start with an ExperimentHub
>> consisting of path specifications and on this you could derive some
>> basic statistics on data availability.  You'd want to have a little
>> more detail on the biology from which the files arose early on, to
>> help organize the
>> high-level description.  For example, I assume you might have separate
>> VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
>> cell types, and from some ChIP-seq ... some samples have all, some
>> have only a few of these assays, and spelling all this out at an early
>> stage would be very useful.
>> On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>> Dear all,
>>> Here is a problem that has been bouncing around in my head, and I
>>> thought it might be time for some discussion. Maybe others have
>>> already figured this out.
>>>    We are often interested in the same genomic regions over multiple
>>>    datasets and multiple samples. Typically, the data are the output of
>>>    a large analysis pipeline. On the surface, SummarizedExperiment
>>>    is very close to the right data structure, but there are some issues.
>>>    Often, these data will be too large to load completely into memory,
>>>    so we need objects that point to out-of-memory storage. This would
>>>    need to be matrix-like, like a BamViews object, but there would be
>>>    redundancy between the ranges in the BamViews and the ranges in
>>>    the rowData. Thus, the BamViews could be created from a
>>>    BamFileList dynamically when the user retrieves an assay, or there
>>>    would need to be consistency checking to make sure the same ranges
>>>    are being described (would be a performance drain).
>>>    Another issue is that certain samples may only be included in
>>>    certain assays. In the simple matrix case, we could handle this with
>>>    NA values. The out-of-memory references will need to support a
>>>    similar semantic. So far, we have not allowed NA in the List
>>>    classes, but I think we might have to move in the direction. In some
>>>    ways, we are stretching the definition of SE here, because we might
>>>    have multiple experiments, not just one.
>>>    Perhaps we are no longer talking about a summary but are focusing
>>>    more on integration, i.e., we are talking about an
>>>    IntegratedExperiment. But I think SummarizedExperiment could be
>>>    coerced into this role.
>>> Let's get this started with a use-case, here is one related to variant
>>> calling:
>>>     Assume we have some output from a sequence analysis pipeline,
>>>     including alignments, coverage and variant calls. We want to
>>>     validate exome variants in RNA, but only where genes are expressed
>>>     (high coverage in RNA). Now assume that a SE has been constructed
>>>     for the exome variant positions and all of the samples. The assays
>>>     are the exome calls (VCF), the RNA calls (VCF), and the RNA
>>>     coverage (BigWig). The algorithm needs to extract the variant
>>>     information as GRanges, and the coverage information as an Rle.
>>>     First, we extract the exome variants:
>>>     > exome.variants <- assay(se, "exome.variants")
>>>     What would exome.variants be? In oncology at least, it is way more
>>>     efficient to output a VCF per sample and then merge them at
>>>     analysis time. Let us assume that there is one VCF file per sample
>>>     and internally there is a VcfFileList (I think Vince has shown
>>>     something like this). The exome.variants object needs to carry
>>>     along the positions from the SE rowData. The minimum conversion
>>>     would be to something like a VcfViews object (as in BamViews). The
>>>     VcfViews object should try to provide the same API has VCF, where
>>>     it makes sense. There are obvious issues like, would the column
>>>     indexing be by sample or by file? Conceptually at least, the
>>>     VcfViews is going to be very similar to a merge of multiple VCF
>>>     files into a VCF object. Would the return value really be a
>>>     VcfViews, or could it coerced directly to a VCF? The coercion may
>>>     be complicated, so it may be best to leave that as a second step,
>>>     after pulling out the assay.
>>>     Alternatively, if there is a single VCF file, the data could be
>>>     stored as a VCF, since it is matrix-like, after all. So SE's could
>>>     be nested. This would obviously be most efficient space-wise if the
>>>     VCF class were implemented on top of a tabix-indexed VCF, with
>>>     on-demand materialization. But maybe it is simpler to just use a
>>>     length-one VcfFileList/VcfViews for this? (As an aside, it would be
>>>     nice if there were some general abstraction for variant data,
>>>     whether stored in VCF, GVF, or some other format/database).
>>>     Then for the coverage:
>>>     > rna.coverage <- assay(se, "rna.coverage")
>>>     Following the conventions above, rna.coverage would be a
>>>     BigWigViews, which might have an API like viewSums, viewMaxs, etc
>>>     for getting back a matrix of coverage summaries, possibly as a
>>>     SummarizedExperiment?
>>> So that's all I have for now.
>>> Thanks,
>>> Michael
>>>          [[alternative HTML version deleted]]
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

More information about the Bioc-devel mailing list