[Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)

Michael Lawrence lawrence.michael at gene.com
Wed Jun 29 06:03:58 CEST 2016

Yes, I am generally in favor of the GViews, and I've been a fan of
Views since they were invented. Was just pointing out that we can do a
lot with Lists these days.

On Tue, Jun 28, 2016 at 4:56 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> On 06/28/2016 05:43 AM, Michael Lawrence wrote:
>> Thanks for these use cases.
>> I've been wondering about the usefulness of Views given how far along
>> Lists have come since the invention of Views. Instead of
>> viewSums(Views(cov, gr)) we could just do sum(cov[gr]). The latter
>> works today. The difference between the Views and List approach is
>> that the Views data structure defers the extraction until
>> summarization.
> I think this is an important feature.
>> So we can always retrieve the entire underlying vector,
>> and the ranges of interest. But for summarize-and-forget use cases,
>> Lists should work fine.
> They work, but they're not super convenient. The user needs to know
> how to import data for his/her regions of interest only. The way to
> do this can vary across the type of file (e.g. 2bit vs BigWig).
> Having GViews objects hides these details and provides a unified
> mode of operating on a set of genomic regions of interest.
> Also, unlike a List, a GViews object bundles together the genomic
> regions and the data associated with them. This makes the object
> self-descriptive and thus reduces the risk of errors.
>> I like the idea of pushing the aggregation down to the data. BigWig
>> files are particularly well suited for this, because they have
>> precomputed summary statistics. See summary,BigWigFile. It would take
>> some hacking of the Kent library to expose everything we want, like
>> the sums.
> This sounds like an argument in favor of using GViews objects to me.
> Thanks,
> H.
>> Michael
>> On Tue, Jun 28, 2016 at 3:54 AM, Johnston, Jeffrey <jjj at stowers.org>
>> wrote:
>>> During the BioC developer day, Hervé brought up the idea of possibly
>>> extending the concept of GenomicViews to other use cases. I'd like to share
>>> how we currently use GenomicRanges, Views and RleLists to perform certain
>>> analyses, and hopefully demonstrate that a way to directly use Views with
>>> GRanges would be quite useful.
>>> As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and
>>> a set of genomic ranges we are interested in (as a BED file). We want to
>>> find the sum of the ChIP-seq signal found in our regions of interest for
>>> each of the two samples. We'd first import the BED file as a GRanges object.
>>> Annotating the GRanges with two metadata columns representing the ChIP-seq
>>> signal for the two samples seems like a straightforward use for Views (in
>>> particular, viewSums), but it is a bit convoluted.
>>> The main problem is that Views work with RangesLists, not GRanges.
>>> Coercing a GRanges to a RangesList possibly disrupts the order, so we have
>>> to first reorder the GRanges to match the order it will be given after
>>> coercion (keeping track so we can later revert back to the original order):
>>> regions <- import("regions_of_interest.bed")
>>> sample1_cov <- import("sample1.bw", as="RleList")
>>> sample2_cov <- import("sample2.bw", as="RleList")
>>> oo <- order(regions)
>>> regions <- regions[oo]
>>> Now we can construct a View and use viewSums to obtain our values
>>> (unlisting them as they are grouped by seqnames) and add them as a metadata
>>> column in our GRanges, restoring the original order of the GRanges when we
>>> are done:
>>> v <- Views(sample1_cov, as(regions, "RangesList"))
>>> mcols(regions)$sample1_signal <- unlist(viewSums(v))
>>> regions[oo] <- regions
>>> We then repeat the process to add another metadata column for sample2.
>>> To me, having the ability to use a GRanges as a view into an RleList
>>> makes a lot more sense. That would allow us to reduce all the above
>>> complexity to something like:
>>> regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
>>> regions$sample2_signal <- viewSums(Views(sample2_cov, regions))
>>> That alone would be great! But, there's a way to make it even better.
>>> Storing these RleLists in memory for each of our samples is quite
>>> inefficient, especially since our regions of interest are only a small
>>> portion of them. The rtracklayer package already has some very useful
>>> functionality for instantiating an RleList with only the data from specific
>>> ranges of a BigWig file. Taking advantage of that, we can dramatically
>>> reduce our memory usage and increase our performance like so:
>>> regions <- import("regions_of_interest.bed")
>>> sample1_cov <- import("sample1.bw", as="RleList", which=regions)
>>> sample2_cov <- import("sample2.bw", as="RleList", which=regions)
>>> regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
>>> regions$sample2_signal <- viewSums(Views(sample2_cov, regions))
>>> But can't this functionality be included in Views? Why not have it accept
>>> a BigWig file in place of an RleList and have it selectively load the
>>> portions of the BigWig it needs based on the provided GRanges? That would
>>> allow this:
>>> regions <- import("regions_of_interest.bed")
>>> regions$sample1_signal <- viewSums(Views("sample1.bw", regions))
>>> regions$sample2_signal <- viewSums(Views("sample2.bw", regions))
>>> The above is quite close to how I use GRanges and BigWigs now, except I
>>> have to write and maintain all the hackish code to link BigWig files,
>>> GRanges, Views, RangesLists and RleLists together into something that
>>> behaves as one would intuitively expect.
>>> I’d welcome any thoughts on how people perform similar analyses that
>>> involve GRanges and data stored in BigWig files or RleLists, and whether
>>> this would also be useful to them.
>>> Thanks,
>>> Jeff Johnston
>>> Zeitlinger Lab
>>> Stowers Institute
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> --
> Hervé Pagès
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

More information about the Bioc-devel mailing list