[BioC] IRanges: Trying to cut overlapping intervals into pieces

Hervé Pagès hpages at fhcrc.org
Thu Jan 29 04:47:01 CET 2009


Hi Elizabeth,

Elizabeth Purdom wrote:
> Thanks for the help. It's much neater than what I had (and I do have 
> gaps). If you are thinking about adding this kind of functionality, I'd 
> be glad to tell you more specifically what I am doing if it helps to 
> have a more general picture.
> 
> I kept thinking the intersect or setdiff would help for different 
> things, but it wound up not doing what I needed. For example, it doesn't 
> do pairwise, but across the entire set, right? I often had a set of 
> intervals that I wanted to 'subtract' pairwise from another set of 
> intervals. Just to know, is there a function that does that?

The union, setdiff or substraction of 2 ranges can not always be represented
by a single range so we need to be careful when performing those operations
pairwise.

For the pairwise intersection, no problem, the intersection of 2 ranges
can always be represented by a single range (eventually empty):

   pairwiseIntersect <- function(ir1, ir2)
   {
     if (length(ir1) != length(ir2))
         stop("'ir1' and 'ir2' must have the same length")
     ans_start <- pmax.int(start(ir1), start(ir2))
     ans_end <- pmin.int(end(ir1), end(ir2))
     ans_width <- ans_end - ans_start + 1L
     ans_width[ans_width < 0L] <- 0L
     IRanges(start=ans_start, width=ans_width)
   }

   > ir1 <- IRanges(c(1, 5, -2, 0, 14), c(10, 9, 3, 11, 17))
   > ir2 <- IRanges(c(14, 0, -5, 6, 18), c(20, 2, 2, 8, 20))
   > ir1
   IRanges object:
     start end width
   1     1  10    10
   2     5   9     5
   3    -2   3     6
   4     0  11    12
   5    14  17     4
   > ir2
   IRanges object:
     start end width
   1    14  20     7
   2     0   2     3
   3    -5   2     8
   4     6   8     3
   5    18  20     3
   > pairwiseIntersect(ir1, ir2)
   IRanges object:
     start end width
   1    14  13     0  <-- note the empty ranges when the intersection is empty
   2     5   4     0
   3    -2   2     5
   4     6   8     3
   5    18  17     0

The union of 2 ranges will produce 1 range if there is no gap between
the 2 ranges (i.e. they are overlapping or adjacent). So, in the general
case, it's not possible to return the result of the pairwise union in an
IRanges object of the same length as the input.
One solution is to add the 'fill.gap' argument to let the user control
whether s/he wants to enforce the union even when there is a gap
between the 2 ranges (in that case, the union is obtained by filling
this gap):

   pairwiseUnion <- function(ir1, ir2, fill.gap=FALSE)
   {
     if (length(ir1) != length(ir2))
         stop("'ir1' and 'ir2' must have the same length")
     if (!fill.gap) {
         gap <- pmax.int(start(ir1), start(ir2)) -
                pmin.int(end(ir1), end(ir2)) - 1L
         if (any(gap > 0L))
             stop("some pair of ranges have a gap within ",
                  "the 2 members of the pair.\n",
                  "  Use 'fill.gap=TRUE' to enforce union ",
                  "by filling the gap.")
     }
     ans_start <- pmin.int(start(ir1), start(ir2))
     ans_end <- pmax.int(end(ir1), end(ir2))
     IRanges(start=ans_start, end=ans_end)
   }

   > pairwiseUnion(ir1, ir2)
   Error in pairwiseUnion(ir1, ir2) :
     some pair of ranges have a gap within the 2 members of the pair.
     Use 'fill.gap=TRUE' to enforce union by filling the gap.
   > pairwiseUnion(ir1[3:5], ir2[3:5])
   IRanges object:
     start end width
   1    -5   3     9
   2     0  11    12
   3    14  20     7
   > pairwiseUnion(ir1, ir2, fill.gap=TRUE)
   IRanges object:
     start end width
   1     1  20    20  <- gap was filled
   2     0   9    10  <- gap was filled
   3    -5   3     9
   4     0  11    12
   5    14  20     7

Subtracting range two from range one will produce 1 range except when
range two has its end points strictly inside range one. What can we do
in such situation? The following function will return an error when this
happens:

   pairwiseSubtract <- function(ir1, ir2)
   {
     if (length(ir1) != length(ir2))
         stop("'ir1' and 'ir2' must have the same length")
     ans_start <- start(ir1)
     ans_end <- end(ir1)
     if (any((start(ir2) > ans_start) & (end(ir2) < ans_end)))
         stop("some ranges in 'ir2' have their end points strictly inside\n",
              "  the range in 'ir1' that they need to be subtracted from")
     start2 <- pmax.int(ans_start, start(ir2))
     end2 <- pmin.int(ans_end, end(ir2))
     ii <- start2 <= end2
     jj <- end2 == ans_end
     kk <- ii & jj
     ans_end[kk] <- start2[kk] - 1L
     kk <- ii & (!jj)
     ans_start[kk] <- end2[kk] + 1L
     IRanges(start=ans_start, end=ans_end)
   }

   > pairwiseSubtract(ir1, ir2)
   Error in pairwiseSubtract(ir1, ir2) :
     some ranges in 'ir2' have their end points strictly inside
     the range in 'ir1' that they need to be subtracted from
   > start(ir2)[4] <- -99
   > end(ir2)[4] <- 99
   > pairwiseSubtract(ir1, ir2)
   IRanges object:
     start end width
   1     1  10    10
   2     5   9     5
   3     3   3     1
   4     0  -1     0  <- nothing left in this range
   5    14  17     4

I will add these 3 functions to the IRanges package and will let you know
when this is done.

Cheers,

H.

> I thought 
> maybe 'narrow', but sometimes subtracting an interval would cut an 
> existing interval into pieces, and narrow doesn't seem to do this.
> 
> Best,
> Elizabeth
> 
> Michael Lawrence wrote:
>>
>>
>> On Fri, Jan 23, 2009 at 10:24 PM, Elizabeth Purdom 
>> <epurdom at stat.berkeley.edu <mailto:epurdom at stat.berkeley.edu>> wrote:
>>
>>     Hi,
>>
>>     I am trying to take overlapping intervals and return a set of
>>     intervals that are not overlapping but cover all of the region (and
>>     mantain the intervals that don't overlap). In particular, I don't
>>     want to merge intervals that overlap together (i.e. the reduce
>>     function in IRanges)-- I want to cut them up into distinct regions.
>>     For example, if I have intervals:
>>     [1,6], [4,8], [7,10]
>>     I want to get back the set of adjacent intervals:
>>     [1,3],[4,6],[7,8],[9,10]
>>
>>
>> Well that's a fun one.
>>
>> ir <- IRanges(c(1, 4, 7), c(6, 8, 10))
>> adj <- IRanges(sort(unique(c(start(ir), head(end(ir),-1)+1))), 
>> sort(unique(c(end(ir), tail(start(ir),-1)-1))))
>>
>> ... is a not so nice one, but pretty fast..
>>  
>> But if you had a gap in those ranges, like:
>>
>> ir <- IRanges(c(1, 4, 10), c(6, 8, 10))
>>
>> So there's a gap at position 9, you would need an additional filtering 
>> step:
>>
>> adj[adj %in% ir]
>>
>> This last step requires the devel version of IRanges, but can be 
>> emulated using !is.na <http://is.na>(overlap(ir, adj, multiple=FALSE)).
>>
>>
>>     The options I find that look like they perhaps do this (intersect or
>>     setdiff?) seem to be related to the 'normal' ranges class; but this
>>     class requires a gap between intervals -- no adjacent intervals --
>>     which is not what I want. Is there a nice way to do this with
>>     IRanges (or a not so nice one, but fast)?
>>
>>
>> The intersect and setdiff functions are for any Ranges, normal or not. 
>> They return normal IRanges though. Perhaps the documentation does not 
>> make this clear. They probably aren't very useful functions.
>>  
>>
>>
>>     Similarly, is there a 'reduce' version that doesn't merge adjacent
>>     intervals but only truly overlapping ones? There are a lot of
>>     annotation examples where you wouldn't not want to merge adjacent
>>     intervals (e.g. UTRs)
>>
>>
>> Try a trick like this:
>>
>> ir2 <- IRanges(c(1, 5, 7), c(4, 6, 9))
>> width(ir2) <- width(ir2) - 1
>> rir2 <- reduce(ir2)
>> width(rir2) <- width(rir2) + 1
>>
>> Or find the overlap, reduce those that did overlap and combine that 
>> result with those that did not overlap.
>>  
>>
>>
>>     Thanks for any assistance!
>>
>>
>> Thanks for providing more use cases. We'll consider adding 
>> functionality along these lines to the base package (actually the 
>> reduce one has been on the TODO list for many months).
>>  
>>
>>
>>     Elizabeth Purdom
>>     Division of Biostatistics
>>     UC, Berkeley
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at stat.math.ethz.ch 
>> <mailto:Bioconductor at stat.math.ethz.ch>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list