[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?

Thu Sep 20 18:30:44 CEST 2012

I'll comment briefly because I think this is a strategically important topic and
I have done a little bit on integration in various forms.

My view of SummarizedExperiment is that it updates the eSet concept to
promote range-based indexing of assay features.  The 'assays' component
is limited to matrix/array like things and my sense is that the "Summarized"
implies that the intention is for a memory-tractable, serializable reduction of
an experiment applied to all of a fixed set of samples.

I felt that what Michael was describing departs significantly from
these conditions/aims
in various ways -- there are multiple assays, possibly at different
stages of summarization, and one
wants a coherent path to interaction with these, requiring less uniformity of
structure.  Entities to be covered are, roughly, a set of biological
samples, mostly assayed in the same ways, but the assays do not imply a common
set of measurements on a fixed set of ranges.

One possible term for the data structure described by Michael is
"ExperimentHub".  This
would include references to various external data resources and it
would have methods
for traversing the resources for certain objectives.  Instead of
nesting the SummarizedExperiment
structures, we could think of certain traversals culminating in
SummarizedExperiment instances.

I think this would lead to high-level workflow prescriptions that
could be broadly applicable --
say you have VCFs and BAMs on a collection of samples with some gaps,
start with an ExperimentHub
consisting of path specifications and on this you could derive some
basic statistics on data availability.  You'd want to have a little
more detail on the biology from which the files arose early on, to
help organize the
high-level description.  For example, I assume you might have separate
VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
cell types, and from some ChIP-seq ... some samples have all, some
have only a few of these assays, and spelling all this out at an early
stage would be very useful.

On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
> Dear all,
>
> Here is a problem that has been bouncing around in my head, and I
> thought it might be time for some discussion. Maybe others have
> already figured this out.
>
>   We are often interested in the same genomic regions over multiple
>   datasets and multiple samples. Typically, the data are the output of
>   a large analysis pipeline. On the surface, SummarizedExperiment
>   is very close to the right data structure, but there are some issues.
>
>   Often, these data will be too large to load completely into memory,
>   so we need objects that point to out-of-memory storage. This would
>   need to be matrix-like, like a BamViews object, but there would be
>   redundancy between the ranges in the BamViews and the ranges in
>   the rowData. Thus, the BamViews could be created from a
>   BamFileList dynamically when the user retrieves an assay, or there
>   would need to be consistency checking to make sure the same ranges
>   are being described (would be a performance drain).
>
>   Another issue is that certain samples may only be included in
>   certain assays. In the simple matrix case, we could handle this with
>   NA values. The out-of-memory references will need to support a
>   similar semantic. So far, we have not allowed NA in the List
>   classes, but I think we might have to move in the direction. In some
>   ways, we are stretching the definition of SE here, because we might
>   have multiple experiments, not just one.
>
>   Perhaps we are no longer talking about a summary but are focusing
>   more on integration, i.e., we are talking about an
>   IntegratedExperiment. But I think SummarizedExperiment could be
>   coerced into this role.
>
> Let's get this started with a use-case, here is one related to variant
> calling:
>
>    Assume we have some output from a sequence analysis pipeline,
>    including alignments, coverage and variant calls. We want to
>    validate exome variants in RNA, but only where genes are expressed
>    (high coverage in RNA). Now assume that a SE has been constructed
>    for the exome variant positions and all of the samples. The assays
>    are the exome calls (VCF), the RNA calls (VCF), and the RNA
>    coverage (BigWig). The algorithm needs to extract the variant
>    information as GRanges, and the coverage information as an Rle.
>
>    First, we extract the exome variants:
>
>    > exome.variants <- assay(se, "exome.variants")
>
>    What would exome.variants be? In oncology at least, it is way more
>    efficient to output a VCF per sample and then merge them at
>    analysis time. Let us assume that there is one VCF file per sample
>    and internally there is a VcfFileList (I think Vince has shown
>    something like this). The exome.variants object needs to carry
>    along the positions from the SE rowData. The minimum conversion
>    would be to something like a VcfViews object (as in BamViews). The
>    VcfViews object should try to provide the same API has VCF, where
>    it makes sense. There are obvious issues like, would the column
>    indexing be by sample or by file? Conceptually at least, the
>    VcfViews is going to be very similar to a merge of multiple VCF
>    files into a VCF object. Would the return value really be a
>    VcfViews, or could it coerced directly to a VCF? The coercion may
>    be complicated, so it may be best to leave that as a second step,
>    after pulling out the assay.
>
>    Alternatively, if there is a single VCF file, the data could be
>    stored as a VCF, since it is matrix-like, after all. So SE's could
>    be nested. This would obviously be most efficient space-wise if the
>    VCF class were implemented on top of a tabix-indexed VCF, with
>    on-demand materialization. But maybe it is simpler to just use a
>    length-one VcfFileList/VcfViews for this? (As an aside, it would be
>    nice if there were some general abstraction for variant data,
>    whether stored in VCF, GVF, or some other format/database).
>
>    Then for the coverage:
>
>    > rna.coverage <- assay(se, "rna.coverage")
>
>    Following the conventions above, rna.coverage would be a
>    BigWigViews, which might have an API like viewSums, viewMaxs, etc
>    for getting back a matrix of coverage summaries, possibly as a
>    SummarizedExperiment?
>
> So that's all I have for now.
>
> Thanks,
> Michael
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel