[BioC] finding neigbouring intervals for 2 GRange objects
Valerie Obenchain
vobencha at fhcrc.org
Tue Mar 5 05:30:52 CET 2013
Hi Herve,
On 03/04/13 15:50, Hervé Pagès wrote:
> Hi,
>
> FWIW I also find it confusing that (d) gives something different:
>
> subject <- GRanges("chr1", IRanges(1:3, width=1))
> x <- subject[2]
>
> (a): + / +
>
> strand(x) <- strand(subject) <- "+"
>
> > precede(x, subject)
> [1] 3
> > follow(x, subject)
> [1] 1
>
>
> (b): + / *
>
> strand(x) <- "+"
> strand(subject) <- "*"
>
> > precede(x, subject)
> [1] 3
> > follow(x, subject)
> [1] 1
>
> (c): * / +
>
> strand(x) <- "*"
> strand(subject) <- "+"
>
> > precede(x, subject)
> [1] 3
> > follow(x, subject)
> [1] 1
>
> (d): * / *
>
> strand(x) <- strand(subject) <- "*"
>
> > precede(x, subject)
> [1] 1
> > follow(x, subject)
> [1] 1
>
> What could be the motivation behind this? Wouldn't things be more
> consistent and less confusing if (a), (b), (c), and (d) all behaved
> consistently with (e):
This is the "maybe we tried to be too clever" part. (d) is consistent
with (e) when ignore.strand=TRUE. By default ignore.strand=FALSE and
these methods look for solutions on both/either strand.
Val
>
> > precede(ranges(x), ranges(subject))
> [1] 3
> > follow(ranges(x), ranges(subject))
> [1] 1
>
> Thanks,
> H.
>
>
> On 03/04/2013 09:57 AM, Michael Lawrence wrote:
>> On Mon, Mar 4, 2013 at 9:17 AM, Valerie Obenchain
>> <vobencha at fhcrc.org>wrote:
>>
>>> Hi Hermann, Harris,
>>>
>>> I wanted to give some background on the behavior of these functions for
>>> GRanges vs IRanges.
>>>
>>>> gr <- GRanges("chr1", IRanges(c(1,2,3), width=1))
>>>> gr
>>>
>>> GRanges with 3 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [1, 1] *
>>> [2] chr1 [2, 2] *
>>> [3] chr1 [3, 3] *
>>>
>>> -----------------------------
>>> "+" strand behavior:
>>> -----------------------------
>>> IRanges objects have no strand so you can expect them to behave as "+"
>>> strand GRanges do. If this is ever not the case please let me know.
>>> That
>>> would be a bug.
>>>
>>>> strand(gr) <- "+"
>>>> precede(gr)
>>> [1] 2 3 NA
>>>> precede(ranges(gr))
>>> [1] 2 3 NA
>>>
>>> -----------------------------
>>> "-" strand behavior:
>>> -----------------------------
>>> I think it makes intuitive sense that precede() and follow() on "-"
>>> strand
>>> GRanges will give the opposite results from precede() and follow()
>>> on "+"
>>> strand GRanges.
>>>
>>>> strand(gr) <- "-"
>>>> precede(gr)
>>> [1] NA 1 2
>>>
>>> -----------------------------
>>> "*" strand behavior:
>>> -----------------------------
>>> For "*" strand GRanges maybe we tried to be too clever. When
>>> ignore.strand=TRUE strand is treated as "+":
>>>> strand(gr) <- "*"
>>>> precede(gr, ignore.strand=TRUE)
>>> [1] 2 3 NA
>>>
>>> When ignore.strand=FALSE for a "*" strand range, results are
>>> computed for
>>> both "+' and "-". If one result is NA, the other is taken. This is
>>> how we
>>> end up with a '2' in positions 1 and 3 (position 3 was NA for "+" and
>>> position 1 was NA for "-"). If there is a tie, the first in order is
>>> taken.
>>> This is how we get the '1' in position 2 (tie was between '1' and
>>> '3' and
>>> '1' comes before '3' in order).
>>>> precede(gr, ignore.strand=FALSE)
>>> [1] 2 1 2
>>>
>>> It was a challenge to decide how "*" strand with ignore.strand=FALSE
>>> should behave. If others have opinions on this I'd be interested in
>>> hearing
>>> them.
>>>
>>>
>>
>> So it seems that "*" in GenomicRanges can mean at least two things:
>> "either" or "missing". I think at one point these were represented by
>> different values, but now the meanings have been merged for simplicity.
>> Instead, we need to use an extra parameter, in this case
>> ignore.strand, to
>> distinguish the two meanings during an operation. It's often
>> convenient to
>> be able to change the meaning on a per-operation basis, but there is
>> also a
>> desire to keep the meaning in the data structure. Anyway, all this
>> has been
>> decided long ago and it probably does not make sense to change
>> anything now.
>>
>> There is some concern though about the usability of the API. Usually,
>> a "*"
>> is used to select the natural/IRanges behavior, but the default of
>> ignore.strand is FALSE, so the "either" meaning is assumed. This is
>> going
>> to surprise users. The same is true of ignore.strand in findOverlaps:
>> many
>> users are surprised to find that by default 'strand' separates the
>> ranges
>> into two coordinate systems, instead of just indicating direction.
>>
>> Michael
>>
>>
>>> Valerie
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 03/03/13 16:38, Harris A. Jaffee wrote:
>>>
>>>> You want precede() and follow(), and then distance(), but tread
>>>> somewhat
>>>> carefully with strands of "*":
>>>>
>>>> x=IRanges(2, 2)
>>>>> x
>>>>>
>>>> IRanges of length 1
>>>> start end width
>>>> [1] 2 2 1
>>>>
>>>> y=IRanges(c(1,3), c(1,3))
>>>>> y
>>>>>
>>>> IRanges of length 2
>>>> start end width
>>>> [1] 1 1 1
>>>> [2] 3 3 1
>>>>
>>>> precede(x, y)
>>>>>
>>>> [1] 2
>>>>
>>>>> follow(x, y)
>>>>>
>>>> [1] 1
>>>>
>>>> X = GRanges(ranges=x, seqnames="chr1")
>>>>> Y = GRanges(ranges=y, seqnames="chr1")
>>>>> X
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] chr1 [2, 2] *
>>>> ---
>>>> seqlengths:
>>>> chr1
>>>> NA
>>>>
>>>>> Y
>>>>>
>>>> GRanges with 2 ranges and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] chr1 [1, 1] *
>>>> [2] chr1 [3, 3] *
>>>> ---
>>>> seqlengths:
>>>> chr1
>>>> NA
>>>>
>>>> precede(X, Y)
>>>>>
>>>> [1] 1
>>>>
>>>>> follow(X, Y)
>>>>>
>>>> [1] 1
>>>>
>>>> precede(X, Y, ignore.strand=TRUE)
>>>>>
>>>> [1] 2
>>>>
>>>>> follow(X, Y, ignore.strand=TRUE)
>>>>>
>>>> [1] 1
>>>>
>>>>
>>>> On Mar 3, 2013, at 4:10 PM, Hermann Norpois wrote:
>>>>
>>>> Hello,
>>>>>
>>>>> I have to grange-objects - gr1 and gr2. And I wish to know:
>>>>> 1) What are the neigbouring intervals of gr1 in gr2 - upstream and
>>>>> downstream?
>>>>> 2) What is the distance in bp between the neighbouring intervals?
>>>>>
>>>>> Could anybody give me a hint how this is done?
>>>>> Thanks
>>>>> Hermann
>>>>>
>>>>> dput (gr1)
>>>>>>
>>>>> new("GRanges"
>>>>> , seqnames = new("Rle"
>>>>> , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>>> "chr12",
>>>>> "chr13",
>>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>> , lengths = 4L
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , ranges = new("IRanges"
>>>>> , start = c(540823L, 752809L, 771015L, 773361L)
>>>>> , width = c(221L, 230L, 520L, 247L)
>>>>> , NAMES = NULL
>>>>> , elementType = "integer"
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , strand = new("Rle"
>>>>> , values = structure(3L, .Label = c("+", "-", "*"), class =
>>>>> "factor")
>>>>> , lengths = 4L
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , elementMetadata = new("DataFrame"
>>>>> , rownames = NULL
>>>>> , nrows = 4L
>>>>> , listData = structure(list(edensity = c(589L, 574L, 578L,
>>>>> 571L),
>>>>> epeak
>>>>> = c(111L,
>>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>>> "epeak", "over"))
>>>>> , elementType = "ANY"
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , seqinfo = new("Seqinfo"
>>>>> , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13",
>>>>> "chr14",
>>>>> "chr15",
>>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>>> "chrX", "chrY")
>>>>> , seqlengths = c(NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>> , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>> NA, NA,
>>>>> NA, NA,
>>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>> , genome = c(NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_
>>>>> )
>>>>> )
>>>>> , metadata = list()
>>>>> )
>>>>>
>>>>>> dput (gr1)
>>>>>>
>>>>> new("GRanges"
>>>>> , seqnames = new("Rle"
>>>>> , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>>> "chr12",
>>>>> "chr13",
>>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>>> , lengths = 4L
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , ranges = new("IRanges"
>>>>> , start = c(540823L, 752809L, 771015L, 773361L)
>>>>> , width = c(221L, 230L, 520L, 247L)
>>>>> , NAMES = NULL
>>>>> , elementType = "integer"
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , strand = new("Rle"
>>>>> , values = structure(3L, .Label = c("+", "-", "*"), class =
>>>>> "factor")
>>>>> , lengths = 4L
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , elementMetadata = new("DataFrame"
>>>>> , rownames = NULL
>>>>> , nrows = 4L
>>>>> , listData = structure(list(edensity = c(589L, 574L, 578L,
>>>>> 571L),
>>>>> epeak
>>>>> = c(111L,
>>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>>> "epeak", "over"))
>>>>> , elementType = "ANY"
>>>>> , elementMetadata = NULL
>>>>> , metadata = list()
>>>>> )
>>>>> , seqinfo = new("Seqinfo"
>>>>> , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13",
>>>>> "chr14",
>>>>> "chr15",
>>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>>> "chrX", "chrY")
>>>>> , seqlengths = c(NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>>> , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>> NA, NA,
>>>>> NA, NA,
>>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>>> , genome = c(NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_,
>>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>>> NA_character_
>>>>> )
>>>>> )
>>>>> , metadata = list()
>>>>> )
>>>>>
>>>>> gr1
>>>>>>
>>>>> GRanges with 4 ranges and 3 metadata columns:
>>>>> seqnames ranges strand | edensity epeak over
>>>>> <Rle> <IRanges> <Rle> | <integer> <integer>
>>>>> <integer>
>>>>> [1] chr1 [540823, 541043] * | 589 111 0
>>>>> [2] chr1 [752809, 753038] * | 574 89 1
>>>>> [3] chr1 [771015, 771534] * | 578 234 1
>>>>> [4] chr1 [773361, 773607] * | 571 136 1
>>>>> ---
>>>>> seqlengths:
>>>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8
>>>>> chr9 chrX
>>>>> chrY
>>>>> NA NA NA NA NA NA ... NA NA NA
>>>>> NA NA
>>>>> NA
>>>>>
>>>>>> gr2
>>>>>>
>>>>> GRanges with 3 ranges and 3 metadata columns:
>>>>> seqnames ranges strand | edensity epeak over
>>>>> <Rle> <IRanges> <Rle> | <integer> <integer>
>>>>> <integer>
>>>>> [1] chr1 [ 53141, 53685] * | 601 212 0
>>>>> [2] chr1 [521536, 521622] * | 568 37 1
>>>>> [3] chr1 [805002, 805650] * | 1000 326 1
>>>>> ---
>>>>> seqlengths:
>>>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8
>>>>> chr9 chrX
>>>>> chrY
>>>>> NA NA NA NA NA NA ... NA NA NA
>>>>> NA NA
>>>>> NA
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>
>>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>
>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list