[BioC] finding neigbouring intervals for 2 GRange objects

Tue Mar 5 05:30:52 CET 2013

Hi Herve,

On 03/04/13 15:50, Hervé Pagès wrote:
> Hi,
>
> FWIW I also find it confusing that (d) gives something different:
>
>   subject <- GRanges("chr1", IRanges(1:3, width=1))
>   x <- subject[2]
>
> (a): + / +
>
>   strand(x) <- strand(subject) <- "+"
>
>   > precede(x, subject)
>   [1] 3
>   > follow(x, subject)
>   [1] 1
>
>
> (b): + / *
>
>   strand(x) <- "+"
>   strand(subject) <- "*"
>
>   > precede(x, subject)
>   [1] 3
>   > follow(x, subject)
>   [1] 1
>
> (c): * / +
>
>   strand(x) <- "*"
>   strand(subject) <- "+"
>
>   > precede(x, subject)
>   [1] 3
>   > follow(x, subject)
>   [1] 1
>
> (d): * / *
>
>   strand(x) <- strand(subject) <- "*"
>
>   > precede(x, subject)
>   [1] 1
>   > follow(x, subject)
>   [1] 1
>
> What could be the motivation behind this? Wouldn't things be more
> consistent and less confusing if (a), (b), (c), and (d) all behaved
> consistently with (e):

This is the "maybe we tried to be too clever" part. (d) is consistent 
with (e) when ignore.strand=TRUE. By default ignore.strand=FALSE and 
these methods look for solutions on both/either strand.

Val

>
>   > precede(ranges(x), ranges(subject))
>   [1] 3
>   > follow(ranges(x), ranges(subject))
>   [1] 1
>
> Thanks,
> H.
>
>
> On 03/04/2013 09:57 AM, Michael Lawrence wrote:
>> On Mon, Mar 4, 2013 at 9:17 AM, Valerie Obenchain 
>> <vobencha at fhcrc.org>wrote:
>>
>>> Hi Hermann, Harris,
>>>
>>> I wanted to give some background on the behavior of these functions for
>>> GRanges vs IRanges.
>>>
>>>> gr <- GRanges("chr1", IRanges(c(1,2,3), width=1))
>>>> gr
>>>
>>> GRanges with 3 ranges and 0 metadata columns:
>>>        seqnames    ranges strand
>>>           <Rle> <IRanges>  <Rle>
>>>    [1]     chr1    [1, 1]      *
>>>    [2]     chr1    [2, 2]      *
>>>    [3]     chr1    [3, 3]      *
>>>
>>> -----------------------------
>>> "+" strand behavior:
>>> -----------------------------
>>> IRanges objects have no strand so you can expect them to behave as "+"
>>> strand GRanges do. If this is ever not the case please let me know. 
>>> That
>>> would be a bug.
>>>
>>>> strand(gr) <- "+"
>>>> precede(gr)
>>> [1]  2  3 NA
>>>> precede(ranges(gr))
>>> [1]  2  3 NA
>>>
>>> -----------------------------
>>> "-" strand behavior:
>>> -----------------------------
>>> I think it makes intuitive sense that precede() and follow() on "-" 
>>> strand
>>> GRanges will give the opposite results from precede() and follow() 
>>> on "+"
>>> strand GRanges.
>>>
>>>> strand(gr) <- "-"
>>>> precede(gr)
>>> [1] NA  1  2
>>>
>>> -----------------------------
>>> "*" strand behavior:
>>> -----------------------------
>>> For "*" strand GRanges maybe we tried to be too clever.  When
>>> ignore.strand=TRUE strand is treated as "+":
>>>> strand(gr) <- "*"
>>>> precede(gr, ignore.strand=TRUE)
>>> [1]  2  3 NA
>>>
>>> When ignore.strand=FALSE for a "*" strand range, results are 
>>> computed for
>>> both "+' and "-". If one result is NA, the other is taken. This is 
>>> how we
>>> end up with a '2' in positions 1 and 3 (position 3 was NA for "+" and
>>> position 1 was NA for "-"). If there is a tie, the first in order is 
>>> taken.
>>> This is how we get the '1' in position 2 (tie was between '1' and 
>>> '3' and
>>> '1' comes before '3' in order).
>>>> precede(gr, ignore.strand=FALSE)
>>> [1] 2 1 2
>>>
>>> It was a challenge to decide how "*" strand with ignore.strand=FALSE
>>> should behave. If others have opinions on this I'd be interested in 
>>> hearing
>>> them.
>>>
>>>
>>
>> So it seems that "*" in GenomicRanges can mean at least two things:
>> "either" or "missing". I think at one point these were represented by
>> different values, but now the meanings have been merged for simplicity.
>> Instead, we need to use an extra parameter, in this case 
>> ignore.strand, to
>> distinguish the two meanings during an operation. It's often 
>> convenient to
>> be able to change the meaning on a per-operation basis, but there is 
>> also a
>> desire to keep the meaning in the data structure. Anyway, all this 
>> has been
>> decided long ago and it probably does not make sense to change 
>> anything now.
>>
>> There is some concern though about the usability of the API. Usually, 
>> a "*"
>> is used to select the natural/IRanges behavior, but the default of
>> ignore.strand is FALSE, so the "either" meaning is assumed. This is 
>> going
>> to surprise users. The same is true of ignore.strand in findOverlaps: 
>> many
>> users are surprised to find that by default 'strand' separates the 
>> ranges
>> into two coordinate systems, instead of just indicating direction.
>>
>> Michael
>>
>>
>>> Valerie
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 03/03/13 16:38, Harris A. Jaffee wrote:
>>>
>>>> You want precede() and follow(), and then distance(), but tread 
>>>> somewhat
>>>> carefully with strands of "*":
>>>>
>>>>   x=IRanges(2, 2)
>>>>> x
>>>>>
>>>> IRanges of length 1
>>>>       start end width
>>>> [1]     2   2     1
>>>>
>>>>   y=IRanges(c(1,3), c(1,3))
>>>>> y
>>>>>
>>>> IRanges of length 2
>>>>       start end width
>>>> [1]     1   1     1
>>>> [2]     3   3     1
>>>>
>>>>   precede(x, y)
>>>>>
>>>> [1] 2
>>>>
>>>>> follow(x, y)
>>>>>
>>>> [1] 1
>>>>
>>>>   X = GRanges(ranges=x, seqnames="chr1")
>>>>> Y = GRanges(ranges=y, seqnames="chr1")
>>>>> X
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>>         seqnames    ranges strand
>>>>            <Rle> <IRanges>  <Rle>
>>>>     [1]     chr1    [2, 2]      *
>>>>     ---
>>>>     seqlengths:
>>>>      chr1
>>>>        NA
>>>>
>>>>> Y
>>>>>
>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>         seqnames    ranges strand
>>>>            <Rle> <IRanges>  <Rle>
>>>>     [1]     chr1    [1, 1]      *
>>>>     [2]     chr1    [3, 3]      *
>>>>     ---
>>>>     seqlengths:
>>>>      chr1
>>>>        NA
>>>>
>>>>   precede(X, Y)
>>>>>
>>>> [1] 1
>>>>
>>>>> follow(X, Y)
>>>>>
>>>> [1] 1
>>>>
>>>>   precede(X, Y, ignore.strand=TRUE)
>>>>>
>>>> [1] 2
>>>>
>>>>> follow(X, Y, ignore.strand=TRUE)
>>>>>
>>>> [1] 1
>>>>
>>>>
>>>> On Mar 3, 2013, at 4:10 PM, Hermann Norpois wrote:
>>>>
>>>>   Hello,
>>>>>
>>>>> I have to grange-objects - gr1 and gr2. And I wish to know:
>>>>> 1) What are the neigbouring intervals of gr1 in gr2 - upstream and
>>>>> downstream?
>>>>> 2) What is the distance in bp between the neighbouring intervals?
>>>>>
>>>>> Could anybody give me a hint how this is done?
>>>>> Thanks
>>>>> Hermann
>>>>>
>>>>>   dput (gr1)
>>>>>>
>>>>> new("GRanges"
>>>>>      , seqnames = new("Rle"
>>>>>      , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>>> "chr12",
>>>>> "chr13",
>>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>>      , lengths = 4L
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , ranges = new("IRanges"
>>>>>      , start = c(540823L, 752809L, 771015L, 773361L)
>>>>>      , width = c(221L, 230L, 520L, 247L)
>>>>>      , NAMES = NULL
>>>>>      , elementType = "integer"
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , strand = new("Rle"
>>>>>      , values = structure(3L, .Label = c("+", "-", "*"), class = 
>>>>> "factor")
>>>>>      , lengths = 4L
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , elementMetadata = new("DataFrame"
>>>>>      , rownames = NULL
>>>>>      , nrows = 4L
>>>>>      , listData = structure(list(edensity = c(589L, 574L, 578L, 
>>>>> 571L),
>>>>> epeak
>>>>> = c(111L,
>>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>>> "epeak", "over"))
>>>>>      , elementType = "ANY"
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , seqinfo = new("Seqinfo"
>>>>>      , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", 
>>>>> "chr14",
>>>>> "chr15",
>>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>>> "chrX", "chrY")
>>>>>      , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, 
>>>>> NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>>      , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
>>>>> NA, NA,
>>>>> NA, NA,
>>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>>      , genome = c(NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_, 
>>>>> NA_character_
>>>>> )
>>>>> )
>>>>>      , metadata = list()
>>>>> )
>>>>>
>>>>>> dput (gr1)
>>>>>>
>>>>> new("GRanges"
>>>>>      , seqnames = new("Rle"
>>>>>      , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>>> "chr12",
>>>>> "chr13",
>>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>>      , lengths = 4L
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , ranges = new("IRanges"
>>>>>      , start = c(540823L, 752809L, 771015L, 773361L)
>>>>>      , width = c(221L, 230L, 520L, 247L)
>>>>>      , NAMES = NULL
>>>>>      , elementType = "integer"
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , strand = new("Rle"
>>>>>      , values = structure(3L, .Label = c("+", "-", "*"), class = 
>>>>> "factor")
>>>>>      , lengths = 4L
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , elementMetadata = new("DataFrame"
>>>>>      , rownames = NULL
>>>>>      , nrows = 4L
>>>>>      , listData = structure(list(edensity = c(589L, 574L, 578L, 
>>>>> 571L),
>>>>> epeak
>>>>> = c(111L,
>>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>>> "epeak", "over"))
>>>>>      , elementType = "ANY"
>>>>>      , elementMetadata = NULL
>>>>>      , metadata = list()
>>>>> )
>>>>>      , seqinfo = new("Seqinfo"
>>>>>      , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", 
>>>>> "chr14",
>>>>> "chr15",
>>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>>> "chrX", "chrY")
>>>>>      , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, 
>>>>> NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>>      , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
>>>>> NA, NA,
>>>>> NA, NA,
>>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>>      , genome = c(NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_, 
>>>>> NA_character_
>>>>> )
>>>>> )
>>>>>      , metadata = list()
>>>>> )
>>>>>
>>>>>   gr1
>>>>>>
>>>>> GRanges with 4 ranges and 3 metadata columns:
>>>>>        seqnames           ranges strand |  edensity epeak      over
>>>>>           <Rle>        <IRanges>  <Rle> | <integer> <integer> 
>>>>> <integer>
>>>>>    [1]     chr1 [540823, 541043]      * |       589 111         0
>>>>>    [2]     chr1 [752809, 753038]      * |       574 89         1
>>>>>    [3]     chr1 [771015, 771534]      * |       578 234         1
>>>>>    [4]     chr1 [773361, 773607]      * |       571 136         1
>>>>>    ---
>>>>>    seqlengths:
>>>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7 chr8  
>>>>> chr9  chrX
>>>>> chrY
>>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA NA    
>>>>> NA    NA
>>>>> NA
>>>>>
>>>>>> gr2
>>>>>>
>>>>> GRanges with 3 ranges and 3 metadata columns:
>>>>>        seqnames           ranges strand |  edensity epeak      over
>>>>>           <Rle>        <IRanges>  <Rle> | <integer> <integer> 
>>>>> <integer>
>>>>>    [1]     chr1 [ 53141,  53685]      * |       601 212         0
>>>>>    [2]     chr1 [521536, 521622]      * |       568 37         1
>>>>>    [3]     chr1 [805002, 805650]      * |      1000 326         1
>>>>>    ---
>>>>>    seqlengths:
>>>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7 chr8  
>>>>> chr9  chrX
>>>>> chrY
>>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA NA    
>>>>> NA    NA
>>>>> NA
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor> 
>>>>>
>>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor> 
>>>>>
>>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor> 
>>>>
>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor> 
>>>>
>>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor> 
>>>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor> 
>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>