[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?

Fri Sep 21 16:09:59 CEST 2012

So here is my 2 cents.  Perhaps a bit rambling, but it is probably
better to weigh in now.

There are two issues here.  One has to do with the right
representation of multiple classes of experiments and one has to do
with having the data on disk instead of in memory.

The last one first.  Michael is right to note that this (having an
on-disk representation) would be highly useful.  Some class that has a
pointer to a file and perhaps a getData method which would pick out a
region and return a SummarizedExperiment would be great.  This would
need to support BAM, bigWig and bigBed at least, and allow for each
sample being in a different file.  This is very much what we tried to
do with Genominator, except that we used a special file format instead
of just being able to point to different file types.

Now for the first one.  The use case I see is where you have a number
of assays on the same individuals but also a number of (different or
the same) assays on other samples.  Let us for example say that you
have done RNA-seq on some people and you want to look at ENCODE
chip-seq data in that region.

I think of this as a _collection_ of SummarizedExperiments.  A
collection because all assays in a SummarizedExperiment need to share
the same ranges.  And if you really have different assays, you may
have copy number (where each range is likely to be long), seq.
expression and chipseq.  They all have different types of structure.
One solution to bring them all into a set of shared ranges is to
essentially do a disjoint on the ranges, but I don't like that.  I
think it will be important to store a single long copy number change
as a single range and not as a union of ranges.

I think it is important to allow different samples for different
experiments (and in fact I think this will be more common - say you
want to contrast you data with other public data in the same region -
this is unlikely to be the same samples).  And I don't think this
should be done by having a lot of NA's in the matrices.

So I think we need something like a list of SummarizedExperiments,
perhaps with a joint sampleData (how a joint sample data is mapped to
multiple assays will need to be thought about).  We might also have a
joint GRanges which signifies "this is the region(s) we have data on",
but we should still retain the individual ranges for each experiment.
Something like

dataRanges
  A GRanges, just telling up what is essentially the union of the
rowData in the SummarizedExperiment below, or perhaps bigger.
copyNumber
  SummarizedExperiment, 3 samples
  has assays "copyNumber" and perhaps "control"
TF binding
  SummarizedExperiment, 10 samples
  has assays "TF1", .., "TF5" and "input"
SampleData
  some kind of joint sample phenodata.

Kasper

On Fri, Sep 21, 2012 at 9:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/20/2012 06:47 PM, Michael Lawrence wrote:
>>
>> Thanks Vince,
>>
>> I think we're on the same page. I agree that a set of ranges-of-interest
>> are not always appropriate, and that the most basic structure would be a
>> table of samples X assays, with missing values. Ranges-of-interest can be
>> layered on top when desired. There are many aspects of
>> SummarizedExperiment
>> that I would want to carry over, especially the idea of metadata, on the
>> samples, assays, and features (when applicable).
>>
>> Michael
>>
>> On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey
>> <stvjc at channing.harvard.edu>wrote:
>>
>>> I'll comment briefly because I think this is a strategically important
>>> topic and
>>> I have done a little bit on integration in various forms.
>>>
>>> My view of SummarizedExperiment is that it updates the eSet concept to
>>> promote range-based indexing of assay features.  The 'assays' component
>>> is limited to matrix/array like things and my sense is that the
>
>
> It might help to nail down a more precise 'API' for what can be in the
> assays slot, but I think it would be definitely array-like; no need for it
> to be an actual 'matrix', though.
>
>
>>> "Summarized"
>>> implies that the intention is for a memory-tractable, serializable
>>> reduction of
>>> an experiment applied to all of a fixed set of samples.
>>>
>>> I felt that what Michael was describing departs significantly from
>>> these conditions/aims
>>> in various ways -- there are multiple assays, possibly at different
>>> stages of summarization, and one
>>> wants a coherent path to interaction with these, requiring less
>>> uniformity
>>> of
>>> structure.  Entities to be covered are, roughly, a set of biological
>
>
> A major task I think would be management of on-disk resources, guaranteeing
> in some way that the object is not tied to some fragile local disk
> structure.
>
> The heterogeneity of data types also seems like a significant departure.
>
>
>>> samples, mostly assayed in the same ways, but the assays do not imply a
>>> common
>>> set of measurements on a fixed set of ranges.
>>>
>>> One possible term for the data structure described by Michael is
>>> "ExperimentHub".  This
>
>
> a nice term.
>
> Martin
>
>
>>> would include references to various external data resources and it
>>> would have methods
>>> for traversing the resources for certain objectives.  Instead of
>>> nesting the SummarizedExperiment
>>> structures, we could think of certain traversals culminating in
>>> SummarizedExperiment instances.
>>>
>>> I think this would lead to high-level workflow prescriptions that
>>> could be broadly applicable --
>>> say you have VCFs and BAMs on a collection of samples with some gaps,
>>> start with an ExperimentHub
>>> consisting of path specifications and on this you could derive some
>>> basic statistics on data availability.  You'd want to have a little
>>> more detail on the biology from which the files arose early on, to
>>> help organize the
>>> high-level description.  For example, I assume you might have separate
>>> VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
>>> cell types, and from some ChIP-seq ... some samples have all, some
>>> have only a few of these assays, and spelling all this out at an early
>>> stage would be very useful.
>>>
>>> On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
>>> <lawrence.michael at gene.com> wrote:
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> Here is a problem that has been bouncing around in my head, and I
>>>> thought it might be time for some discussion. Maybe others have
>>>> already figured this out.
>>>>
>>>>    We are often interested in the same genomic regions over multiple
>>>>    datasets and multiple samples. Typically, the data are the output of
>>>>    a large analysis pipeline. On the surface, SummarizedExperiment
>>>>    is very close to the right data structure, but there are some issues.
>>>>
>>>>    Often, these data will be too large to load completely into memory,
>>>>    so we need objects that point to out-of-memory storage. This would
>>>>    need to be matrix-like, like a BamViews object, but there would be
>>>>    redundancy between the ranges in the BamViews and the ranges in
>>>>    the rowData. Thus, the BamViews could be created from a
>>>>    BamFileList dynamically when the user retrieves an assay, or there
>>>>    would need to be consistency checking to make sure the same ranges
>>>>    are being described (would be a performance drain).
>>>>
>>>>    Another issue is that certain samples may only be included in
>>>>    certain assays. In the simple matrix case, we could handle this with
>>>>    NA values. The out-of-memory references will need to support a
>>>>    similar semantic. So far, we have not allowed NA in the List
>>>>    classes, but I think we might have to move in the direction. In some
>>>>    ways, we are stretching the definition of SE here, because we might
>>>>    have multiple experiments, not just one.
>>>>
>>>>    Perhaps we are no longer talking about a summary but are focusing
>>>>    more on integration, i.e., we are talking about an
>>>>    IntegratedExperiment. But I think SummarizedExperiment could be
>>>>    coerced into this role.
>>>>
>>>> Let's get this started with a use-case, here is one related to variant
>>>> calling:
>>>>
>>>>     Assume we have some output from a sequence analysis pipeline,
>>>>     including alignments, coverage and variant calls. We want to
>>>>     validate exome variants in RNA, but only where genes are expressed
>>>>     (high coverage in RNA). Now assume that a SE has been constructed
>>>>     for the exome variant positions and all of the samples. The assays
>>>>     are the exome calls (VCF), the RNA calls (VCF), and the RNA
>>>>     coverage (BigWig). The algorithm needs to extract the variant
>>>>     information as GRanges, and the coverage information as an Rle.
>>>>
>>>>     First, we extract the exome variants:
>>>>
>>>>     > exome.variants <- assay(se, "exome.variants")
>>>>
>>>>     What would exome.variants be? In oncology at least, it is way more
>>>>     efficient to output a VCF per sample and then merge them at
>>>>     analysis time. Let us assume that there is one VCF file per sample
>>>>     and internally there is a VcfFileList (I think Vince has shown
>>>>     something like this). The exome.variants object needs to carry
>>>>     along the positions from the SE rowData. The minimum conversion
>>>>     would be to something like a VcfViews object (as in BamViews). The
>>>>     VcfViews object should try to provide the same API has VCF, where
>>>>     it makes sense. There are obvious issues like, would the column
>>>>     indexing be by sample or by file? Conceptually at least, the
>>>>     VcfViews is going to be very similar to a merge of multiple VCF
>>>>     files into a VCF object. Would the return value really be a
>>>>     VcfViews, or could it coerced directly to a VCF? The coercion may
>>>>     be complicated, so it may be best to leave that as a second step,
>>>>     after pulling out the assay.
>>>>
>>>>     Alternatively, if there is a single VCF file, the data could be
>>>>     stored as a VCF, since it is matrix-like, after all. So SE's could
>>>>     be nested. This would obviously be most efficient space-wise if the
>>>>     VCF class were implemented on top of a tabix-indexed VCF, with
>>>>     on-demand materialization. But maybe it is simpler to just use a
>>>>     length-one VcfFileList/VcfViews for this? (As an aside, it would be
>>>>     nice if there were some general abstraction for variant data,
>>>>     whether stored in VCF, GVF, or some other format/database).
>>>>
>>>>     Then for the coverage:
>>>>
>>>>     > rna.coverage <- assay(se, "rna.coverage")
>>>>
>>>>     Following the conventions above, rna.coverage would be a
>>>>     BigWigViews, which might have an API like viewSums, viewMaxs, etc
>>>>     for getting back a matrix of coverage summaries, possibly as a
>>>>     SummarizedExperiment?
>>>>
>>>> So that's all I have for now.
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel