[BioC] views on Rle using GRanges object

Tue Aug 16 19:55:51 CEST 2011

Hi,

Thank you both - I'll make sure I keep everything sorted for now (and perhaps later I can be less vigilant and the code will enforce sorting, as well as provide a method to help me do it - that would be nice).

From my more casual user's point of view, I still get confused sometimes about when it's better to use RangedData and when GRanges objects are better (I've asked that on the list before and did get a helpful answer). It seems they're getting more and more comparable as time goes by, but as you say, Herve, still a few methods are available for one, and some for the other.   I'm mostly using GRanges for now. It would be nice for me just to stick with one or the other in my code.

I'm imagining that at some point in future I'll be able to do everything using GRanges, without coercing back and forth and worrying about naming, ordering, and keeping my metadata in the same order - does that sound like what you're aiming for from a package design perspective?  I wish I could have attended your session on this at BioC2011 but I was at some other equally interesting session at the same time.

Janet

On Aug 16, 2011, at 10:39 AM, Hervé Pagès wrote:

> Hi Michael,
> 
> On 11-08-15 10:28 PM, Michael Lawrence wrote:
>> On Mon, Aug 15, 2011 at 10:10 PM, Michael Lawrence<michafla at gene.com>wrote:
>> 
>>> 
>>> 
>>> On Mon, Aug 15, 2011 at 5:25 PM, Janet Young<jayoung at fhcrc.org>  wrote:
>>> 
>>>> Hi again,
>>>> 
>>>> I have another question, in the "I think I need a convoluted workaround,
>>>> but maybe I'm missing a simple solution" genre (seems like most of my
>>>> questions are like that).
>>>> 
>>>> I have an RleList object representing mapability for the whole human
>>>> genome. I'd like to:
>>>> (a) calculate viewMeans for various regions of interest that I've been
>>>> representing as GRanges, and
>>>> (b) I'd like the underlying code to be smart and match chromosome names up
>>>> in the RleList and the GRanges object (not rely on chromosomes being ordered
>>>> the same in the two objects), and
>>>> (c) I'd like to return the viewMeans results in the same order as the
>>>> objects in my original GRanges
>>>> 
>>>> I don't think this is currently implemented without doing several
>>>> coercions (that introduce their own complications) but I'm not sure.  Some
>>>> code is below that shows what I'm trying to do.   It seems like it might be
>>>> a common kind of way to use viewMeans - I imagine people are gradually
>>>> switching to use GRanges more and more? but really I don't know.
>>>> 
>>>> My real world question is that I've read in mapability from a UCSC bigWig
>>>> file, and made that into an RleList.  I have a bunch of other regions I've
>>>> read in (again from UCSC) using rtracklayer as GRanges, and have annotated
>>>> those with various things I'm interested in (e.g. number of RNAseq reads in
>>>> the region).  I want to add the average mapability for each region, so that
>>>> I can later look at how mapability affects those other things I'm
>>>> annotating.
>>>> 
>>>> Should I be being more strict with myself about how my GRanges are ordered
>>>> and making sure they all have the same seqlevels and seqlengths?  Perhaps
>>>> that would help the coercion workarounds go more smoothly.
>>>> 
>>>> thanks,
>>>> 
>>>> Janet
>>>> 
>>>> 
>>>> library(GenomicRanges)
>>>> 
>>>> ### make an RleList. Just for this example, I starting with a GRanges
>>>> object and used coverage to get an RleList example. To get my real data I
>>>> read in a UCSC bigWig file.
>>>> 
>>>> fakeRegions<- GRanges(seqnames=c("chrA","chrA", "chrB", "chrC"),
>>>> ranges=IRanges(start=c(1,51,1,1), end=c(60,90,20,10) ) )
>>>> seqlengths(fakeRegions)<- c(100,100,100)
>>>> myRleList<- coverage(fakeRegions)
>>>> rm(fakeRegions)
>>>> 
>>>> myRleList
>>>> 
>>>> # SimpleRleList of length 3
>>>> # $chrA
>>>> # 'integer' Rle of length 100 with 4 runs
>>>> #   Lengths: 50 10 30 10
>>>> #   Values :  1  2  1  0
>>>> #
>>>> # $chrB
>>>> # 'integer' Rle of length 100 with 2 runs
>>>> #   Lengths: 20 80
>>>> #   Values :  1  0
>>>> #
>>>> # $chrC
>>>> # 'integer' Rle of length 100 with 2 runs
>>>> #   Lengths: 10 90
>>>> #   Values :  1  0
>>>> 
>>>> ### make some regions of interest
>>>> myRegions<- GRanges(seqnames=c("chrB", "chrC", "chrB"),
>>>> ranges=IRanges(start=c(1,1,5), end=c(20,20,10) ),
>>>> geneNames=c("g1","g2","g3") )
>>>> myRegions
>>>> # GRanges with 3 ranges and 1 elementMetadata value
>>>> #     seqnames    ranges strand |   geneNames
>>>> #<Rle>  <IRanges>   <Rle>  |<character>
>>>> # [1]     chrB   [1, 20]      * |          g1
>>>> # [2]     chrC   [1, 20]      * |          g2
>>>> # [3]     chrB   [5, 10]      * |          g3
>>>> #
>>>> # seqlengths
>>>> #  chrB chrC
>>>> #    NA   NA
>>>> 
>>>> ## can't use GRanges directly
>>>> Views( myRleList, myRegions)
>>>> # Error in RleViewsList(rleList = subject, rangesList = start) :
>>>> #   'rangesList' must be a RangesList object
>>>> 
>>>> ## can't use a simple coercion
>>>> Views( myRleList, as(myRegions,"RangesList") )
>>>> # Error in .Method(..., FUN = FUN, MoreArgs = MoreArgs, SIMPLIFY =
>>>> SIMPLIFY,  :
>>>> #   all object lengths must be multiple of longest object length
>>>> 
>>>> ### although I can use that coercion if I first give the GRanges object
>>>> the same levels as the RleList to force the lists to have the same names as
>>>> each other:
>>>> seqlevels(myRegions)<- names(myRleList)
>>>> viewMeans(Views( myRleList, as(myRegions,"RangesList") ) )
>>>> # SimpleNumericList of length 3
>>>> # [["chrA"]] numeric(0)
>>>> # [["chrB"]] 1 1
>>>> # [["chrC"]] 0.5
>>>> 
>>>> 
>>> These two lines seem pretty concise to me. It makes sense to layer a
>>> RangesList on top of an RleList. Storing the result would of course be
>>> easier if you had sorted the GRanges by the seqlevels. Having a utility for
>>> that would be nice (orderBySeqlevels?). It would also be easy with
>>> RangedData, which enforces ordering by space.
>>> 
>> 
>> Just to clarify, here is how the function would work:
>> 
>> values(myRegions)$meanCov[orderBySeqlevels(myRegions)]<-
>> unlist(viewMeans(Views( myRleList, as(myRegions,"RangesList") ) ),
>> use.names=FALSE)
>> 
>> So 'orderBySeqlevels' would be a nice bridge back into the flat GRanges
>> world from the Listy world of IRanges.
> 
> Well I'm just remembering now that we still don't have an "order"
> method for GRanges objects like we do for IRanges object. The "natural"
> order for a GRanges object would be to order first by (a) seqlevel,
> (b) then by strand, (c) then by start, (d) and finally by end.
> 
> We already do (c) and (d) for IRanges.
> 
> Also the "reduce" method for GRanges already uses this "natural"
> order implicitly.
> 
> So we will add this "order" method and the other basic order-related
> methods (sort, >=, <=, ==, !=, unique, duplicated, they are all
> missing) for GRanges objects. That will cover Janet's use case and
> other user cases (I think someone asked how to find duplicated in
> a GRanges object a while ago on this list).
> 
> Cheers,
> H.
> 
>> 
>> 
>>> Michael
>>> 
>>> 
>>> 
>>>> ### getting close to a solution - but I'd have liked to have been able to
>>>> unlist this and directly add to my GRanges object e.g.
>>>> values(myRegions)$meanCov<- unlist(viewMeans(Views( myRleList,
>>>> as(myRegions,"RangesList") ) ), use.names=FALSE)
>>>> ### but the regions are in a different order to how I started, so the
>>>> above command would associate the wrong scores with the wrong regions.
>>>> 
>>>> sessionInfo()
>>>> 
>>>> R version 2.13.1 (2011-07-08)
>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>> 
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> 
>>>> other attached packages:
>>>> [1] GenomicRanges_1.4.6 IRanges_1.10.5
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319