[BioC] subset a RleList using GRanges object?

Cook, Malcolm MEC at stowers.org
Fri Dec 13 00:39:50 CET 2013


Herve,

>On 12/04/2013 05:18 PM, Hervé Pagès wrote:
 >> Hi Malcolm,
 >>
 >> On 12/04/2013 11:43 AM, Cook, Malcolm wrote:
 >>> Janet, et al
 >>>
 >>> I believe there are problems with all approaches mentioned so far on
 >>> this thread.
 >>>
 >>> When you index something, length of the result should equal the length
 >>> of the index,
 >>
 >> Not if the index is a logical vector or a Ranges object though.
 >>
 >>> AND, the order of the elements of the result should correspond to the
 >>> order of the indices.
 >>>
 >>> The approaches mentioned all depend upon coersion from GRanges to
 >>> RangesList, but this does not preserve the order of the GRanges.
 >>
 >> True but it's guaranteed to preserve the order within each
 >> space/chromosome. I think this is good enough to make subsetting
 >> a named list-like object (e.g. named RleList, named DNAStringSet,
 >> etc...) by a GRanges object a reasonable thing to support.
 >> The order of the seqlevels in the GRanges object doesn't matter
 >> and won't affect the result of the subsetting because they're
 >> matched to the names of the list-like object to subset.
 >>
 >> Note that subsetting needs to behave like an "endomorphism" i.e.
 >> the result must be of the same type as the original object. So
 >> when subsetting an RleList by a GRanges index (like in Janet's
 >> example), the result will always be an RleList object, even if
 >> the index has more than 1 range per chromosome (in that case
 >> the extracted ranges are concatenated together). I'm still not
 >> convinced this is what Janet really needs.
 >
 >I take this back. Yes subsetting by a GRanges needs to behave as an
 >"endomorphism" but it doesn't have to behave the way I'm describing
 >here (which is the behavior we would get if GRanges subscript 'gr'
 >was replaced with 'as(gr, "RangesList")').
>
 >A more useful behavior (and maybe this is what Malcolm was trying
 >to tell me, sorry) is to return an object that is parallel to the
 >GRanges object:

Yes, exactly, thanks, "parallel" in that it stands in 1-to-1 with the components of the GRanges.

However, the summarization step wants to be computed inside the score of the indexing especially if the indexing is being mcapplied since you want the smallest (summarized) value coming back between processes....

~Malcolm

 >
 >   > gr <- GRanges(Rle(c("chr2", "chr1"), c(3, 2)), IRanges(1, 7:3))
 >   > gr
 >   GRanges with 5 ranges and 0 metadata columns:
 >         seqnames    ranges strand
 >            <Rle> <IRanges>  <Rle>
 >     [1]     chr2    [1, 7]      *
 >     [2]     chr2    [1, 6]      *
 >     [3]     chr2    [1, 5]      *
 >     [4]     chr1    [1, 4]      *
 >     [5]     chr1    [1, 3]      *
 >     ---
 >     seqlengths:
 >      chr1 chr2
 >        NA   NA
 >
 >With an RleList:
 >
 >   > x <- RleList(chr1=1:6, chr2=101:110)
 >   > x[gr]
 >   RleList of length 5
 >   $chr2
 >   integer-Rle of length 7 with 7 runs
 >     Lengths:   1   1   1   1   1   1   1
 >     Values : 101 102 103 104 105 106 107
 >
 >   $chr2
 >   integer-Rle of length 6 with 6 runs
 >     Lengths:   1   1   1   1   1   1
 >     Values : 101 102 103 104 105 106
 >
 >   $chr2
 >   integer-Rle of length 5 with 5 runs
 >     Lengths:   1   1   1   1   1
 >     Values : 101 102 103 104 105
 >
 >   $chr1
 >   integer-Rle of length 4 with 4 runs
 >     Lengths: 1 1 1 1
 >     Values : 1 2 3 4
 >
 >   $chr1
 >   integer-Rle of length 3 with 3 runs
 >     Lengths: 1 1 1
 >     Values : 1 2 3
 >
 >With a DNAStringSet:
 >
 >   > dna <- DNAStringSet(c(chr1="AATANG", chr2="TTACCACGTT"))
 >   > dna[gr]
 >     A DNAStringSet instance of length 5
 >       width seq                                               names
 >
 >   [1]     7 TTACCAC                                           chr2
 >   [2]     6 TTACCA                                            chr2
 >   [3]     5 TTACC                                             chr2
 >   [4]     4 AATA                                              chr1
 >   [5]     3 AAT                                               chr1
 >
 >Then the returned object can be added as a metadata col to the GRanges
 >object, possibly after passing it thru an extra summarization step (with
 >e.g. 'mean(x[gr])' in the case of the RleList), which is probably
 >infinitely more useful than a subsetting that would replace 'gr' with
 >'as(gr, "RangesList")'. And it's still an "endomorphism".
 >
 >I've added this to GenomicRanges 1.15.14 (devel). Only very lightly
 >tested and not optimized yet.
 >
 >Cheers,
 >H.
 >
 >
 >> Like Michael mentioned
 >> earlier, maybe a more useful approach is to create an RleViewsList
 >> object with something like:
 >>
 >>    vlist <- Views(rleList=z, rangesList=as(myRange, "RangesList"))
 >>    viewMeans(vlist)
 >>
 >> More below.
 >>
 >>>
 >>> Look:
 >>>
 >>>> example(GRangesList)
 >>>> as(gr3,'RangesList')
 >>>   IRangesList of length 2
 >>>   $chr1
 >>>   IRanges of length 1
 >>>       start end width
 >>>   [1]     1   3     3
 >>>
 >>>   $chr2
 >>>   IRanges of length 1
 >>>       start end width
 >>>   [1]     4   9     6
 >>>
 >>> !> as(gr3[2:1],'RangesList')
 >>>   IRangesList of length 2
 >>>   $chr1
 >>>   IRanges of length 1
 >>>       start end width
 >>>   [1]     1   3     3
 >>>
 >>>   $chr2
 >>>   IRanges of length 1
 >>>       start end width
 >>>   [1]     4   9     6
 >>>
 >>> I think if you contrive for your GRanges to be in 'canonical order'
 >>> (as defined by the RangesList coersion), you can proceed with this
 >>> approach.  Look:
 >>>
 >>>> stopifnot(all.equal(as(sort(gr3[2:1]),'RangesList'),as(sort(gr3[1:2]),'RangesList')))
 >>>>
 >>>
 >>> Janet, you brought this up in
 >>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/36247/focus=36263
 >>>
 >>>
 >>> with more or less the same outcome.
 >>>
 >>> Here is excerpt from that exchange:
 >>>
 >>>>> An "order" method would be great. Note though that the coercion to
 >>>>> RangesList only sorts by seqlevels, so we would need something that
 >>>>> gave
 >>>>> out that order vector. The point here is to avoid forcing the user to
 >>>>> modify the order of the original GRanges.
 >>>>>
 >>>>
 >>>> With order() the user will still be able to achieve this with something
 >>>> like:
 >>>>
 >>>>   regionMeans <- function(regions, cvg)
 >>>>   {
 >>>>       seqlevels(regions) <- names(cvg)
 >>>>       oo <- order(regions)
 >>>>       regions <- regions[oo]
 >>>>       ans <- unlist(
 >>>>                viewMeans(
 >>>>                  Views(cvg, as(regions, "RangesList"))
 >>>>                ), use.names=FALSE)
 >>>>       ans[oo] <- ans  # restore original order
 >>>>       ans
 >>>>   }
 >>>>
 >>>>   values(myRegions)$meanCov <- regionMeans(myRegions, myRleList)
 >>>>
 >>>> Tricky :-/
 >>
 >> Very related indeed. Thanks for refreshing my memory. Note that this
 >> discussion happened more than 2 years ago at a time when we didn't have
 >> order() for GRanges yet. Now we have it so my regionMeans() proposal
 >> above should work. FWIW in ?tileGenome (GenomicRanges package), I show
 >> how to implement a similar function (binnedAverage) without the need
 >> for ordering the GRanges object (it uses a split/unsplit aproach
 >> instead). The design of binnedAverage() is also a little bit cleaner.
 >>
 >>>
 >>> Let's not forget this lesson!
 >>>
 >>> However, I agree, some syntactic sugar that allows indexing an RleList
 >>> by GRanges with the proposed semantics would be great.
 >>>
 >>> In the mean time, I offer this function, which I found to be faster
 >>> than Herve's offering above, and also provides a few more useful (to
 >>> me) abstractions.
 >>>
 >>> YMMV:
 >>>
 >>>
 >>> applyAt <- function(cvg # an RleList (quite possibly representing a
 >>> genomic coverage)
 >>>                      ,gr # a GRanges at each element of which you want
 >>> to evaluate a function
 >>>                      ,FUN=identity # by default, just return
 >>> individual rleLists for each element of gr
 >>>                      ,...          # additional options to FUN
 >>>                      ,.MAPPLY=mapply #optionally parallize with
 >>> mcmapply or your favorite variant
 >>>                      ,USE.NAMES=TRUE
 >>>                      ,SIMPLIFY=TRUE #'array'#TRUE
 >>>                      ,MoreArgs=NULL # passed to .MAPPLY
 >>>                      ,ignore.strand=TRUE # else, reverse the coverages
 >>>                      ) {
 >>>    revif<-function(s,x) {
 >>>      if(ignore.strand==TRUE) {x}
 >>>      else if (s=='+')  {x}
 >>>      else {rev(x)}
 >>>    }
 >>>    ans<-.MAPPLY(function(a,b,c,s) {
 >>>        FUN(revif(s,cvg[[a]][IRanges(b,c),drop=FALSE]),...)
 >>>    }
 >>>
 >>> ,as.vector(seqnames(gr)),start(gr),end(gr),as.vector(strand(gr))
 >>>                 ,MoreArgs=MoreArgs
 >>>                 ,SIMPLIFY=SIMPLIFY
 >>>                 ,USE.NAMES=FALSE         #USE.NAMES
 >>>                 )
 >>>    if (!USE.NAMES) {}
 >>>    else if (is.null(dim(ans))) {names(ans)<-names(gr)}
 >>>    else if (SIMPLIFY=='array') {colnames(ans)<-names(gr)}
 >>>    else  rownames(ans)<-names(gr)
 >>>    ans
 >>> }
 >>>
 >>> Finally, what I think we REALLY want is a version of the above which
 >>> currys a FUN (possibly `identity`) down to rle levels that can be
 >>> applied to a bigwig file without having to load the whole thing into
 >>> memory.
 >>>
 >>> In the mean time, I offer:
 >>>
 >>> applyAt.bw<-function(x,y,...) {
 >>>      applyAt(import.bw(x,asRle=TRUE
 >>>                        ,selection=BigWigSelection(y)
 >>>                        )
 >>>              ,y
 >>>              ,...)
 >>> }
 >>
 >> Interesting. It sounds to me like maybe one way to facilitate this
 >> (and also avoid the proliferation of apply-like functions) would be
 >> to finally bite the bullet and come up with a new kind of view objects
 >> where the views are *genomic* ranges instead of just ranges (like they
 >> are right now). Something that was already mentioned in this
 >> more-than-2-year-old discussion. We could start with RleListGViews
 >> and BigWigFileGViews objects, both concrete subclasses of virtual
 >> class GViews.
 >>
 >> Thanks for you feedback,
 >> H.
 >>
 >>>
 >>> ~Malcolm
 >>>
 >>>
 >>>   >-----Original Message-----
 >>>   >From: bioconductor-bounces at r-project.org
 >>> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Janet Young
 >>>   >Sent: Wednesday, December 04, 2013 12:59 PM
 >>>   >To: Hervé Pagès
 >>>   >Cc: Michael Lawrence; bioconductor at r-project.org
 >>>   >Subject: Re: [BioC] subset a RleList using GRanges object?
 >>>   >
 >>>   >Thanks, all.  Yes, that coercion will do it for me:   z[as(myRange,
 >>> "RangesList")].   It's always nice for the end user when you take care of
 >>>   >those kinds of things behind the scenes for us - sounds like what
 >>> you're considering implementing might take care of it.
 >>>   >
 >>>   >Robert: thanks for the ideas!  But my needs are a little different,
 >>> though - I'm keeping the individual values in the region I'm interested
 >>>   >in (I'm plotting RNA-seq coverage on just a single gene, but my
 >>> coverage object is for the whole genome).
 >>>   >
 >>>   >Janet
 >>>   >
 >>>   >
 >>>   >
 >>>   >On Dec 4, 2013, at 10:09 AM, Hervé Pagès wrote:
 >>>   >
 >>>   >> Hi Michael,
 >>>   >>
 >>>   >> On 12/03/2013 07:13 PM, Michael Lawrence wrote:
 >>>   >>> For now, just coerce to a RangesList.
 >>>   >>>
 >>>   >>> z [ as(myRange, "RangesList") ]
 >>>   >>>
 >>>   >>> Herve might consider adding a [,List,GenomicRanges method...
 >>> kind of a
 >>>   >>> logical leap though from seqnames to list element names.
 >>>   >>
 >>>   >> This is something I've been considering for a while. I think we
 >>> should
 >>>   >> do it. The mapping between the seqnames of a GRanges and the
 >>> names of a
 >>>   >> RangesList is well established at this point e.g. the coercion
 >>> (back and
 >>>   >> forth) between the two already does that.
 >>>   >>
 >>>   >> Cheers,
 >>>   >> H.
 >>>   >>
 >>>   >>>
 >>>   >>> extractCoverageForPositions is different; it extracts an integer
 >>> vector
 >>>   >>> given a vector of width-one positions.
 >>>   >>>
 >>>   >>>
 >>>   >>>
 >>>   >>>
 >>>   >>>
 >>>   >>>
 >>>   >>> On Tue, Dec 3, 2013 at 4:53 PM, Janet Young <jayoung at fhcrc.org>
 >>> wrote:
 >>>   >>>
 >>>   >>>> Hi there,
 >>>   >>>>
 >>>   >>>> I'm playing around with coverage data generated outside of R,
 >>> planning to
 >>>   >>>> plot RNA-seq coverage for some genes we're interested in.
 >>>   >>>>
 >>>   >>>> I have a request - it'd be nice from my point of view if it
 >>> were possible
 >>>   >>>> to look at a small region an RleList (or CompressedRleList)
 >>> using a GRanges
 >>>   >>>> object to focus on that smaller region.  Looks like I can
 >>> subset a single
 >>>   >>>> Rle object with an IRanges object, but I wonder if this nice
 >>> feature could
 >>>   >>>> be extended to GRanges objects?
 >>>   >>>>
 >>>   >>>> Some code is below to show what I'm trying to do.
 >>>   >>>>
 >>>   >>>> thanks veruy much,
 >>>   >>>>
 >>>   >>>> Janet
 >>>   >>>>
 >>>   >>>>
 >>>   >>>>
 >>>   >>>> library(GenomicRanges)
 >>>   >>>>
 >>>   >>>> ## an example RleList
 >>>   >>>> x <- Rle(10:1, 1:10)
 >>>   >>>> y <- Rle(10:1, 1:10)
 >>>   >>>> z <- RleList( chr1=x, chr2=y)
 >>>   >>>>
 >>>   >>>> ## an example GRanges object
 >>>   >>>> myRange <- GRanges( seqnames="chr1",
 >>> ranges=IRanges(start=10,end=15) )
 >>>   >>>>
 >>>   >>>> ## subsetting an Rle using an IRanges object works, as expected:
 >>>   >>>> z[["chr1"]] [ ranges(myRange) ]
 >>>   >>>>
 >>>   >>>> ## but subsetting an RleList by GRanges object doesn't work
 >>>   >>>> z [ myRange ]
 >>>   >>>> # Error in normalizeSingleBracketSubscript(i, x) : invalid
 >>> subscript type
 >>>   >>>>
 >>>   >>>> sessionInfo()
 >>>   >>>>
 >>>   >>>> R Under development (unstable) (2013-11-06 r64163)
 >>>   >>>> Platform: x86_64-unknown-linux-gnu (64-bit)
 >>>   >>>>
 >>>   >>>> locale:
 >>>   >>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 >>>   >>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 >>>   >>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 >>>   >>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 >>>   >>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 >>>   >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
 >>>   >>>>
 >>>   >>>> attached base packages:
 >>>   >>>> [1] parallel  stats     graphics  grDevices utils     datasets
 >>> methods
 >>>   >>>> [8] base
 >>>   >>>>
 >>>   >>>> other attached packages:
 >>>   >>>> [1] GenomicRanges_1.15.10 XVector_0.3.2         IRanges_1.21.13
 >>>   >>>> [4] BiocGenerics_0.9.1
 >>>   >>>>
 >>>   >>>> loaded via a namespace (and not attached):
 >>>   >>>> [1] stats4_3.1.0
 >>>   >>>>
 >>>   >>>>
 >>>   >>>>
 >>>   >>>>
 >>>   >>>>
 >>> -------------------------------------------------------------------
 >>>   >>>>
 >>>   >>>> Dr. Janet Young
 >>>   >>>>
 >>>   >>>> Malik lab
 >>>   >>>> http://research.fhcrc.org/malik/en.html
 >>>   >>>>
 >>>   >>>> Fred Hutchinson Cancer Research Center
 >>>   >>>> 1100 Fairview Avenue N., A2-025,
 >>>   >>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
 >>>   >>>>
 >>>   >>>> tel: (206) 667 4512
 >>>   >>>> email: jayoung  ...at...  fhcrc.org
 >>>   >>>>
 >>>   >>>>
 >>> -------------------------------------------------------------------
 >>>   >>>>
 >>>   >>>> _______________________________________________
 >>>   >>>> Bioconductor mailing list
 >>>   >>>> Bioconductor at r-project.org
 >>>   >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
 >>>   >>>> Search the archives:
 >>>   >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>>   >>>>
 >>>   >>>
 >>>   >>>     [[alternative HTML version deleted]]
 >>>   >>>
 >>>   >>> _______________________________________________
 >>>   >>> Bioconductor mailing list
 >>>   >>> Bioconductor at r-project.org
 >>>   >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
 >>>   >>> Search the archives:
 >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>>   >>>
 >>>   >>
 >>>   >> --
 >>>   >> Hervé Pagès
 >>>   >>
 >>>   >> Program in Computational Biology
 >>>   >> Division of Public Health Sciences
 >>>   >> Fred Hutchinson Cancer Research Center
 >>>   >> 1100 Fairview Ave. N, M1-B514
 >>>   >> P.O. Box 19024
 >>>   >> Seattle, WA 98109-1024
 >>>   >>
 >>>   >> E-mail: hpages at fhcrc.org
 >>>   >> Phone:  (206) 667-5791
 >>>   >> Fax:    (206) 667-1319
 >>>   >
 >>>   >_______________________________________________
 >>>   >Bioconductor mailing list
 >>>   >Bioconductor at r-project.org
 >>>   >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >>>   >Search the archives:
 >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>>
 >>
 >
 >--
 >Hervé Pagès
 >
 >Program in Computational Biology
 >Division of Public Health Sciences
 >Fred Hutchinson Cancer Research Center
 >1100 Fairview Ave. N, M1-B514
 >P.O. Box 19024
 >Seattle, WA 98109-1024
 >
 >E-mail: hpages at fhcrc.org
 >Phone:  (206) 667-5791
 >Fax:    (206) 667-1319



More information about the Bioconductor mailing list