[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?
Martin Morgan
mtmorgan at fhcrc.org
Fri Sep 21 15:50:36 CEST 2012
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
> Thanks Vince,
>
> I think we're on the same page. I agree that a set of ranges-of-interest
> are not always appropriate, and that the most basic structure would be a
> table of samples X assays, with missing values. Ranges-of-interest can be
> layered on top when desired. There are many aspects of SummarizedExperiment
> that I would want to carry over, especially the idea of metadata, on the
> samples, assays, and features (when applicable).
>
> Michael
>
> On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>
>> I'll comment briefly because I think this is a strategically important
>> topic and
>> I have done a little bit on integration in various forms.
>>
>> My view of SummarizedExperiment is that it updates the eSet concept to
>> promote range-based indexing of assay features. The 'assays' component
>> is limited to matrix/array like things and my sense is that the
It might help to nail down a more precise 'API' for what can be in the
assays slot, but I think it would be definitely array-like; no need for
it to be an actual 'matrix', though.
>> "Summarized"
>> implies that the intention is for a memory-tractable, serializable
>> reduction of
>> an experiment applied to all of a fixed set of samples.
>>
>> I felt that what Michael was describing departs significantly from
>> these conditions/aims
>> in various ways -- there are multiple assays, possibly at different
>> stages of summarization, and one
>> wants a coherent path to interaction with these, requiring less uniformity
>> of
>> structure. Entities to be covered are, roughly, a set of biological
A major task I think would be management of on-disk resources,
guaranteeing in some way that the object is not tied to some fragile
local disk structure.
The heterogeneity of data types also seems like a significant departure.
>> samples, mostly assayed in the same ways, but the assays do not imply a
>> common
>> set of measurements on a fixed set of ranges.
>>
>> One possible term for the data structure described by Michael is
>> "ExperimentHub". This
a nice term.
Martin
>> would include references to various external data resources and it
>> would have methods
>> for traversing the resources for certain objectives. Instead of
>> nesting the SummarizedExperiment
>> structures, we could think of certain traversals culminating in
>> SummarizedExperiment instances.
>>
>> I think this would lead to high-level workflow prescriptions that
>> could be broadly applicable --
>> say you have VCFs and BAMs on a collection of samples with some gaps,
>> start with an ExperimentHub
>> consisting of path specifications and on this you could derive some
>> basic statistics on data availability. You'd want to have a little
>> more detail on the biology from which the files arose early on, to
>> help organize the
>> high-level description. For example, I assume you might have separate
>> VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
>> cell types, and from some ChIP-seq ... some samples have all, some
>> have only a few of these assays, and spelling all this out at an early
>> stage would be very useful.
>>
>> On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>>
>>> Dear all,
>>>
>>> Here is a problem that has been bouncing around in my head, and I
>>> thought it might be time for some discussion. Maybe others have
>>> already figured this out.
>>>
>>> We are often interested in the same genomic regions over multiple
>>> datasets and multiple samples. Typically, the data are the output of
>>> a large analysis pipeline. On the surface, SummarizedExperiment
>>> is very close to the right data structure, but there are some issues.
>>>
>>> Often, these data will be too large to load completely into memory,
>>> so we need objects that point to out-of-memory storage. This would
>>> need to be matrix-like, like a BamViews object, but there would be
>>> redundancy between the ranges in the BamViews and the ranges in
>>> the rowData. Thus, the BamViews could be created from a
>>> BamFileList dynamically when the user retrieves an assay, or there
>>> would need to be consistency checking to make sure the same ranges
>>> are being described (would be a performance drain).
>>>
>>> Another issue is that certain samples may only be included in
>>> certain assays. In the simple matrix case, we could handle this with
>>> NA values. The out-of-memory references will need to support a
>>> similar semantic. So far, we have not allowed NA in the List
>>> classes, but I think we might have to move in the direction. In some
>>> ways, we are stretching the definition of SE here, because we might
>>> have multiple experiments, not just one.
>>>
>>> Perhaps we are no longer talking about a summary but are focusing
>>> more on integration, i.e., we are talking about an
>>> IntegratedExperiment. But I think SummarizedExperiment could be
>>> coerced into this role.
>>>
>>> Let's get this started with a use-case, here is one related to variant
>>> calling:
>>>
>>> Assume we have some output from a sequence analysis pipeline,
>>> including alignments, coverage and variant calls. We want to
>>> validate exome variants in RNA, but only where genes are expressed
>>> (high coverage in RNA). Now assume that a SE has been constructed
>>> for the exome variant positions and all of the samples. The assays
>>> are the exome calls (VCF), the RNA calls (VCF), and the RNA
>>> coverage (BigWig). The algorithm needs to extract the variant
>>> information as GRanges, and the coverage information as an Rle.
>>>
>>> First, we extract the exome variants:
>>>
>>> > exome.variants <- assay(se, "exome.variants")
>>>
>>> What would exome.variants be? In oncology at least, it is way more
>>> efficient to output a VCF per sample and then merge them at
>>> analysis time. Let us assume that there is one VCF file per sample
>>> and internally there is a VcfFileList (I think Vince has shown
>>> something like this). The exome.variants object needs to carry
>>> along the positions from the SE rowData. The minimum conversion
>>> would be to something like a VcfViews object (as in BamViews). The
>>> VcfViews object should try to provide the same API has VCF, where
>>> it makes sense. There are obvious issues like, would the column
>>> indexing be by sample or by file? Conceptually at least, the
>>> VcfViews is going to be very similar to a merge of multiple VCF
>>> files into a VCF object. Would the return value really be a
>>> VcfViews, or could it coerced directly to a VCF? The coercion may
>>> be complicated, so it may be best to leave that as a second step,
>>> after pulling out the assay.
>>>
>>> Alternatively, if there is a single VCF file, the data could be
>>> stored as a VCF, since it is matrix-like, after all. So SE's could
>>> be nested. This would obviously be most efficient space-wise if the
>>> VCF class were implemented on top of a tabix-indexed VCF, with
>>> on-demand materialization. But maybe it is simpler to just use a
>>> length-one VcfFileList/VcfViews for this? (As an aside, it would be
>>> nice if there were some general abstraction for variant data,
>>> whether stored in VCF, GVF, or some other format/database).
>>>
>>> Then for the coverage:
>>>
>>> > rna.coverage <- assay(se, "rna.coverage")
>>>
>>> Following the conventions above, rna.coverage would be a
>>> BigWigViews, which might have an API like viewSums, viewMaxs, etc
>>> for getting back a matrix of coverage summaries, possibly as a
>>> SummarizedExperiment?
>>>
>>> So that's all I have for now.
>>>
>>> Thanks,
>>> Michael
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list