[BioC] views on Rle using GRanges object
    Hervé Pagès 
    hpages at fhcrc.org
       
    Mon Mar 31 19:09:29 CEST 2014
    
    
  
Hi Michael,
On 03/31/2014 05:22 AM, Michael Lawrence wrote:
> Raising this one from the dead: this would be a very nice thing to have.
> It's tough explaining to users how to bridge GRanges and RangesList for
> aggregating vectors (RleLists) by GRanges. We could start with
> rangeSums, rangeMeans, rangeMaxs, etc, that directly summarize and
> RleList, and then maybe move towards a GenomicViews object. Maybe
> something for the next release cycle?
Yeah I'd like to start working on something like this. Note that, kind
of related to this, I added subsetting a named List by a GRanges
subscript recently. But what we really need is a simple container that
bundles a GRanges object with a named List subject, I think.
H.
>
> Michael
>
>
> On Wed, Aug 17, 2011 at 5:55 AM, Michael Lawrence <michafla at gene.com
> <mailto:michafla at gene.com>> wrote:
>
>
>
>     2011/8/16 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>         Michael,
>
>         On 11-08-16 10:52 AM, Michael Lawrence wrote:
>
>
>
>             2011/8/16 Hervé Pagès <hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org>>>
>
>
>                 Hi Michael,
>
>
>                 On 11-08-15 10:28 PM, Michael Lawrence wrote:
>
>                     On Mon, Aug 15, 2011 at 10:10 PM, Michael
>                     Lawrence<michafla at gene.com
>             <mailto:michafla at gene.com> <mailto:michafla at gene.com
>             <mailto:michafla at gene.com>>>____wrote:
>
>
>
>
>                         On Mon, Aug 15, 2011 at 5:25 PM, Janet
>                         Young<jayoung at fhcrc.org
>             <mailto:jayoung at fhcrc.org> <mailto:jayoung at fhcrc.org
>             <mailto:jayoung at fhcrc.org>>>  wrote:
>
>
>                             Hi again,
>
>                             I have another question, in the "I think I
>             need a
>                             convoluted workaround,
>                             but maybe I'm missing a simple solution"
>             genre (seems
>                             like most of my
>                             questions are like that).
>
>                             I have an RleList object representing
>             mapability for the
>                             whole human
>                             genome. I'd like to:
>                             (a) calculate viewMeans for various regions
>             of interest
>                             that I've been
>                             representing as GRanges, and
>                             (b) I'd like the underlying code to be smart
>             and match
>                             chromosome names up
>                             in the RleList and the GRanges object (not
>             rely on
>                             chromosomes being ordered
>                             the same in the two objects), and
>                             (c) I'd like to return the viewMeans results
>             in the same
>                             order as the
>                             objects in my original GRanges
>
>                             I don't think this is currently implemented
>             without
>                             doing several
>                             coercions (that introduce their own
>             complications) but
>                             I'm not sure.  Some
>                             code is below that shows what I'm trying to
>             do.   It
>                             seems like it might be
>                             a common kind of way to use viewMeans - I
>             imagine people
>                             are gradually
>                             switching to use GRanges more and more? but
>             really I
>                             don't know.
>
>                             My real world question is that I've read in
>             mapability
>                             from a UCSC bigWig
>                             file, and made that into an RleList.  I have
>             a bunch of
>                             other regions I've
>                             read in (again from UCSC) using rtracklayer
>             as GRanges,
>                             and have annotated
>                             those with various things I'm interested in
>             (e.g. number
>                             of RNAseq reads in
>                             the region).  I want to add the average
>             mapability for
>                             each region, so that
>                             I can later look at how mapability affects
>             those other
>                             things I'm
>                             annotating.
>
>                             Should I be being more strict with myself
>             about how my
>                             GRanges are ordered
>                             and making sure they all have the same
>             seqlevels and
>                             seqlengths?  Perhaps
>                             that would help the coercion workarounds go
>             more smoothly.
>
>                             thanks,
>
>                             Janet
>
>
>                             library(GenomicRanges)
>
>                             ### make an RleList. Just for this example,
>             I starting
>                             with a GRanges
>                             object and used coverage to get an RleList
>             example. To
>                             get my real data I
>                             read in a UCSC bigWig file.
>
>                             fakeRegions<-
>             GRanges(seqnames=c("chrA","____chrA",
>                             "chrB", "chrC"),
>                             ranges=IRanges(start=c(1,51,1,____1),
>             end=c(60,90,20,10) ) )
>                             seqlengths(fakeRegions)<- c(100,100,100)
>                             myRleList<- coverage(fakeRegions)
>                             rm(fakeRegions)
>
>                             myRleList
>
>                             # SimpleRleList of length 3
>                             # $chrA
>                             # 'integer' Rle of length 100 with 4 runs
>                             #   Lengths: 50 10 30 10
>                             #   Values :  1  2  1  0
>                             #
>                             # $chrB
>                             # 'integer' Rle of length 100 with 2 runs
>                             #   Lengths: 20 80
>                             #   Values :  1  0
>                             #
>                             # $chrC
>                             # 'integer' Rle of length 100 with 2 runs
>                             #   Lengths: 10 90
>                             #   Values :  1  0
>
>                             ### make some regions of interest
>                             myRegions<- GRanges(seqnames=c("chrB",
>             "chrC", "chrB"),
>                             ranges=IRanges(start=c(1,1,5),
>             end=c(20,20,10) ),
>                             geneNames=c("g1","g2","g3") )
>                             myRegions
>                             # GRanges with 3 ranges and 1
>             elementMetadata value
>                             #     seqnames    ranges strand |   geneNames
>                             #<Rle> <IRanges> <Rle>  |<character>
>                             # [1]     chrB   [1, 20]      * |          g1
>                             # [2]     chrC   [1, 20]      * |          g2
>                             # [3]     chrB   [5, 10]      * |          g3
>                             #
>                             # seqlengths
>                             #  chrB chrC
>                             #    NA   NA
>
>                             ## can't use GRanges directly
>                             Views( myRleList, myRegions)
>                             # Error in RleViewsList(rleList = subject,
>             rangesList =
>                             start) :
>                             # 'rangesList' must be a RangesList object
>
>                             ## can't use a simple coercion
>                             Views( myRleList, as(myRegions,"RangesList") )
>                             # Error in .Method(..., FUN = FUN, MoreArgs
>             = MoreArgs,
>                             SIMPLIFY =
>                             SIMPLIFY,  :
>                             #   all object lengths must be multiple of
>             longest
>                             object length
>
>                             ### although I can use that coercion if I
>             first give the
>                             GRanges object
>                             the same levels as the RleList to force the
>             lists to
>                             have the same names as
>                             each other:
>                             seqlevels(myRegions)<- names(myRleList)
>                             viewMeans(Views( myRleList,
>             as(myRegions,"RangesList") ) )
>                             # SimpleNumericList of length 3
>                             # [["chrA"]] numeric(0)
>                             # [["chrB"]] 1 1
>                             # [["chrC"]] 0.5
>
>
>                         These two lines seem pretty concise to me. It
>             makes sense to
>                         layer a
>                         RangesList on top of an RleList. Storing the
>             result would of
>                         course be
>                         easier if you had sorted the GRanges by the
>             seqlevels.
>                         Having a utility for
>                         that would be nice (orderBySeqlevels?). It would
>             also be
>                         easy with
>                         RangedData, which enforces ordering by space.
>
>
>                     Just to clarify, here is how the function would work:
>
>
>               values(myRegions)$meanCov[____orderBySeqlevels(myRegions)]<-
>                     unlist(viewMeans(Views( myRleList,
>             as(myRegions,"RangesList") ) ),
>                     use.names=FALSE)
>
>                     So 'orderBySeqlevels' would be a nice bridge back
>             into the flat
>                     GRanges
>                     world from the Listy world of IRanges.
>
>
>                 Well I'm just remembering now that we still don't have
>             an "order"
>                 method for GRanges objects like we do for IRanges
>             object. The "natural"
>                 order for a GRanges object would be to order first by
>             (a) seqlevel,
>                 (b) then by strand, (c) then by start, (d) and finally
>             by end.
>
>                 We already do (c) and (d) for IRanges.
>
>                 Also the "reduce" method for GRanges already uses this
>             "natural"
>                 order implicitly.
>
>                 So we will add this "order" method and the other basic
>             order-related
>                 methods (sort, >=, <=, ==, !=, unique, duplicated, they
>             are all
>                 missing) for GRanges objects. That will cover Janet's
>             use case and
>                 other user cases (I think someone asked how to find
>             duplicated in
>                 a GRanges object a while ago on this list).
>
>
>             An "order" method would be great. Note though that the
>             coercion to
>             RangesList only sorts by seqlevels, so we would need
>             something that gave
>             out that order vector. The point here is to avoid forcing
>             the user to
>             modify the order of the original GRanges.
>
>
>         With order() the user will still be able to achieve this with
>         something
>         like:
>
>           regionMeans <- function(regions, cvg)
>           {
>               seqlevels(regions) <- names(cvg)
>               oo <- order(regions)
>               regions <- regions[oo]
>               ans <- unlist(
>                        viewMeans(
>                          Views(cvg, as(regions, "RangesList"))
>                        ), use.names=FALSE)
>               ans[oo] <- ans  # restore original order
>               ans
>           }
>
>           values(myRegions)$meanCov <- regionMeans(myRegions, myRleList)
>
>         Tricky :-/
>
>         But maybe we could provide a "mean" method that accepts 2 args:
>         a GRanges and an RleList/IntegerList/__NumericList. Same with
>         min, max, sum etc... they would all return a numeric vector of the
>         same length as their first argument (the GRanges object).
>
>
>     Patrick and I had talked about this. We were calling it the
>     rangeMeans function. It would go along with rangeMins, rangeMaxs,
>     etc. Basically a way to perform a windowed computation without
>     having to construct an intermediate Views. This is a lot easier to
>     design and implement, as you were saying below.
>
>     There are a lot of other (non-Rle) Vectors that would benefit from
>     window-based summaries. These include the basic vector types, as
>     well as ranges themselves. For example, I and others have
>     encountered a need to find the nearest neighbors of a set of ranges,
>     with the constraint that they fall within the same gene or exon or
>     whatever. One could come up with a GRanges with the seqnames
>     corresponding to a gene, but that's a hack and the current looping
>     implementation of nearest,GRanges is a bit slow.
>
>     Michael
>
>         Another approach could be that we make the Views() constructor more
>         flexible so Views(myRleList, myRegions) would just work and return
>         a Views object (not a ViewsList) but that means redefining what a
>         Views object can be (@subject can now be a named List and @ranges
>         a GRanges object). Maybe a cleaner design would be to use a new
>         container for this e.g. GViews (for Genomic Views)...
>
>         H.
>
>
>
>             Michael
>
>                 Cheers,
>                 H.
>
>
>
>                         Michael
>
>
>
>                             ### getting close to a solution - but I'd
>             have liked to
>                             have been able to
>                             unlist this and directly add to my GRanges
>             object e.g.
>                             values(myRegions)$meanCov<-
>             unlist(viewMeans(Views(
>                             myRleList,
>                             as(myRegions,"RangesList") ) ), use.names=FALSE)
>                             ### but the regions are in a different order
>             to how I
>                             started, so the
>                             above command would associate the wrong
>             scores with the
>                             wrong regions.
>
>                             sessionInfo()
>
>                             R version 2.13.1 (2011-07-08)
>                             Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
>                             locale:
>                             [1]
>             en_US.UTF-8/en_US.UTF-8/C/C/____en_US.UTF-8/en_US.UTF-8
>
>                             attached base packages:
>                             [1] stats     graphics  grDevices utils
>             datasets
>                               methods   base
>
>                             other attached packages:
>                             [1] GenomicRanges_1.4.6 IRanges_1.10.5
>
>
>               ___________________________________________________
>                             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>                             <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>
>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>               <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                             Search the archives:
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>               <http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>                             [[alternative HTML version deleted]]
>
>
>                     ___________________________________________________
>                     Bioconductor mailing list
>             Bioconductor at r-project.org
>             <mailto:Bioconductor at r-project.org>
>             <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>
>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>               <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                     Search the archives:
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>               <http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>                 --
>                 Hervé Pagès
>
>                 Program in Computational Biology
>                 Division of Public Health Sciences
>                 Fred Hutchinson Cancer Research Center
>                 1100 Fairview Ave. N, M1-B514
>                 P.O. Box 19024
>                 Seattle, WA 98109-1024
>
>                 E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>                 Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>             <tel:%28206%29%20667-5791>
>                 Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>             <tel:%28206%29%20667-1319>
>
>
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
-- 
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
    
    
More information about the Bioconductor
mailing list