[BioC] finding neigbouring intervals for 2 GRange objects

Tue Mar 5 00:50:13 CET 2013

Hi,

FWIW I also find it confusing that (d) gives something different:

   subject <- GRanges("chr1", IRanges(1:3, width=1))
   x <- subject[2]

(a): + / +

   strand(x) <- strand(subject) <- "+"

   > precede(x, subject)
   [1] 3
   > follow(x, subject)
   [1] 1

(b): + / *

   strand(x) <- "+"
   strand(subject) <- "*"

   > precede(x, subject)
   [1] 3
   > follow(x, subject)
   [1] 1

(c): * / +

   strand(x) <- "*"
   strand(subject) <- "+"

   > precede(x, subject)
   [1] 3
   > follow(x, subject)
   [1] 1

(d): * / *

   strand(x) <- strand(subject) <- "*"

   > precede(x, subject)
   [1] 1
   > follow(x, subject)
   [1] 1

What could be the motivation behind this? Wouldn't things be more
consistent and less confusing if (a), (b), (c), and (d) all behaved
consistently with (e):

   > precede(ranges(x), ranges(subject))
   [1] 3
   > follow(ranges(x), ranges(subject))
   [1] 1

Thanks,
H.

On 03/04/2013 09:57 AM, Michael Lawrence wrote:
> On Mon, Mar 4, 2013 at 9:17 AM, Valerie Obenchain <vobencha at fhcrc.org>wrote:
>
>> Hi Hermann, Harris,
>>
>> I wanted to give some background on the behavior of these functions for
>> GRanges vs IRanges.
>>
>>> gr <- GRanges("chr1", IRanges(c(1,2,3), width=1))
>>> gr
>>
>> GRanges with 3 ranges and 0 metadata columns:
>>        seqnames    ranges strand
>>           <Rle> <IRanges>  <Rle>
>>    [1]     chr1    [1, 1]      *
>>    [2]     chr1    [2, 2]      *
>>    [3]     chr1    [3, 3]      *
>>
>> -----------------------------
>> "+" strand behavior:
>> -----------------------------
>> IRanges objects have no strand so you can expect them to behave as "+"
>> strand GRanges do. If this is ever not the case please let me know. That
>> would be a bug.
>>
>>> strand(gr) <- "+"
>>> precede(gr)
>> [1]  2  3 NA
>>> precede(ranges(gr))
>> [1]  2  3 NA
>>
>> -----------------------------
>> "-" strand behavior:
>> -----------------------------
>> I think it makes intuitive sense that precede() and follow() on "-" strand
>> GRanges will give the opposite results from precede() and follow() on "+"
>> strand GRanges.
>>
>>> strand(gr) <- "-"
>>> precede(gr)
>> [1] NA  1  2
>>
>> -----------------------------
>> "*" strand behavior:
>> -----------------------------
>> For "*" strand GRanges maybe we tried to be too clever.  When
>> ignore.strand=TRUE strand is treated as "+":
>>> strand(gr) <- "*"
>>> precede(gr, ignore.strand=TRUE)
>> [1]  2  3 NA
>>
>> When ignore.strand=FALSE for a "*" strand range, results are computed for
>> both "+' and "-". If one result is NA, the other is taken. This is how we
>> end up with a '2' in positions 1 and 3 (position 3 was NA for "+" and
>> position 1 was NA for "-"). If there is a tie, the first in order is taken.
>> This is how we get the '1' in position 2 (tie was between '1' and '3' and
>> '1' comes before '3' in order).
>>> precede(gr, ignore.strand=FALSE)
>> [1] 2 1 2
>>
>> It was a challenge to decide how "*" strand with ignore.strand=FALSE
>> should behave. If others have opinions on this I'd be interested in hearing
>> them.
>>
>>
>
> So it seems that "*" in GenomicRanges can mean at least two things:
> "either" or "missing". I think at one point these were represented by
> different values, but now the meanings have been merged for simplicity.
> Instead, we need to use an extra parameter, in this case ignore.strand, to
> distinguish the two meanings during an operation. It's often convenient to
> be able to change the meaning on a per-operation basis, but there is also a
> desire to keep the meaning in the data structure. Anyway, all this has been
> decided long ago and it probably does not make sense to change anything now.
>
> There is some concern though about the usability of the API. Usually, a "*"
> is used to select the natural/IRanges behavior, but the default of
> ignore.strand is FALSE, so the "either" meaning is assumed. This is going
> to surprise users. The same is true of ignore.strand in findOverlaps: many
> users are surprised to find that by default 'strand' separates the ranges
> into two coordinate systems, instead of just indicating direction.
>
> Michael
>
>
>> Valerie
>>
>>
>>
>>
>>
>>
>> On 03/03/13 16:38, Harris A. Jaffee wrote:
>>
>>> You want precede() and follow(), and then distance(), but tread somewhat
>>> carefully with strands of "*":
>>>
>>>   x=IRanges(2, 2)
>>>> x
>>>>
>>> IRanges of length 1
>>>       start end width
>>> [1]     2   2     1
>>>
>>>   y=IRanges(c(1,3), c(1,3))
>>>> y
>>>>
>>> IRanges of length 2
>>>       start end width
>>> [1]     1   1     1
>>> [2]     3   3     1
>>>
>>>   precede(x, y)
>>>>
>>> [1] 2
>>>
>>>> follow(x, y)
>>>>
>>> [1] 1
>>>
>>>   X = GRanges(ranges=x, seqnames="chr1")
>>>> Y = GRanges(ranges=y, seqnames="chr1")
>>>> X
>>>>
>>> GRanges with 1 range and 0 metadata columns:
>>>         seqnames    ranges strand
>>>            <Rle> <IRanges>  <Rle>
>>>     [1]     chr1    [2, 2]      *
>>>     ---
>>>     seqlengths:
>>>      chr1
>>>        NA
>>>
>>>> Y
>>>>
>>> GRanges with 2 ranges and 0 metadata columns:
>>>         seqnames    ranges strand
>>>            <Rle> <IRanges>  <Rle>
>>>     [1]     chr1    [1, 1]      *
>>>     [2]     chr1    [3, 3]      *
>>>     ---
>>>     seqlengths:
>>>      chr1
>>>        NA
>>>
>>>   precede(X, Y)
>>>>
>>> [1] 1
>>>
>>>> follow(X, Y)
>>>>
>>> [1] 1
>>>
>>>   precede(X, Y, ignore.strand=TRUE)
>>>>
>>> [1] 2
>>>
>>>> follow(X, Y, ignore.strand=TRUE)
>>>>
>>> [1] 1
>>>
>>>
>>> On Mar 3, 2013, at 4:10 PM, Hermann Norpois wrote:
>>>
>>>   Hello,
>>>>
>>>> I have to grange-objects - gr1 and gr2. And I wish to know:
>>>> 1) What are the neigbouring intervals of gr1 in gr2 - upstream and
>>>> downstream?
>>>> 2) What is the distance in bp between the neighbouring intervals?
>>>>
>>>> Could anybody give me a hint how this is done?
>>>> Thanks
>>>> Hermann
>>>>
>>>>   dput (gr1)
>>>>>
>>>> new("GRanges"
>>>>      , seqnames = new("Rle"
>>>>      , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>> "chr12",
>>>> "chr13",
>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>      , lengths = 4L
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , ranges = new("IRanges"
>>>>      , start = c(540823L, 752809L, 771015L, 773361L)
>>>>      , width = c(221L, 230L, 520L, 247L)
>>>>      , NAMES = NULL
>>>>      , elementType = "integer"
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , strand = new("Rle"
>>>>      , values = structure(3L, .Label = c("+", "-", "*"), class = "factor")
>>>>      , lengths = 4L
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , elementMetadata = new("DataFrame"
>>>>      , rownames = NULL
>>>>      , nrows = 4L
>>>>      , listData = structure(list(edensity = c(589L, 574L, 578L, 571L),
>>>> epeak
>>>> = c(111L,
>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>> "epeak", "over"))
>>>>      , elementType = "ANY"
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , seqinfo = new("Seqinfo"
>>>>      , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>>> "chr15",
>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>> "chrX", "chrY")
>>>>      , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>      , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> NA, NA,
>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>      , genome = c(NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
>>>> )
>>>> )
>>>>      , metadata = list()
>>>> )
>>>>
>>>>> dput (gr1)
>>>>>
>>>> new("GRanges"
>>>>      , seqnames = new("Rle"
>>>>      , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>> "chr12",
>>>> "chr13",
>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>      , lengths = 4L
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , ranges = new("IRanges"
>>>>      , start = c(540823L, 752809L, 771015L, 773361L)
>>>>      , width = c(221L, 230L, 520L, 247L)
>>>>      , NAMES = NULL
>>>>      , elementType = "integer"
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , strand = new("Rle"
>>>>      , values = structure(3L, .Label = c("+", "-", "*"), class = "factor")
>>>>      , lengths = 4L
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , elementMetadata = new("DataFrame"
>>>>      , rownames = NULL
>>>>      , nrows = 4L
>>>>      , listData = structure(list(edensity = c(589L, 574L, 578L, 571L),
>>>> epeak
>>>> = c(111L,
>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>> "epeak", "over"))
>>>>      , elementType = "ANY"
>>>>      , elementMetadata = NULL
>>>>      , metadata = list()
>>>> )
>>>>      , seqinfo = new("Seqinfo"
>>>>      , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>>> "chr15",
>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>> "chrX", "chrY")
>>>>      , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>      , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> NA, NA,
>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>      , genome = c(NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
>>>> )
>>>> )
>>>>      , metadata = list()
>>>> )
>>>>
>>>>   gr1
>>>>>
>>>> GRanges with 4 ranges and 3 metadata columns:
>>>>        seqnames           ranges strand |  edensity     epeak      over
>>>>           <Rle>        <IRanges>  <Rle> | <integer> <integer> <integer>
>>>>    [1]     chr1 [540823, 541043]      * |       589       111         0
>>>>    [2]     chr1 [752809, 753038]      * |       574        89         1
>>>>    [3]     chr1 [771015, 771534]      * |       578       234         1
>>>>    [4]     chr1 [773361, 773607]      * |       571       136         1
>>>>    ---
>>>>    seqlengths:
>>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX
>>>> chrY
>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA
>>>> NA
>>>>
>>>>> gr2
>>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>>        seqnames           ranges strand |  edensity     epeak      over
>>>>           <Rle>        <IRanges>  <Rle> | <integer> <integer> <integer>
>>>>    [1]     chr1 [ 53141,  53685]      * |       601       212         0
>>>>    [2]     chr1 [521536, 521622]      * |       568        37         1
>>>>    [3]     chr1 [805002, 805650]      * |      1000       326         1
>>>>    ---
>>>>    seqlengths:
>>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX
>>>> chrY
>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA
>>>> NA
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319