[BioC] finding neigbouring intervals for 2 GRange objects
Hervé Pagès
hpages at fhcrc.org
Tue Mar 5 00:50:13 CET 2013
Hi,
FWIW I also find it confusing that (d) gives something different:
subject <- GRanges("chr1", IRanges(1:3, width=1))
x <- subject[2]
(a): + / +
strand(x) <- strand(subject) <- "+"
> precede(x, subject)
[1] 3
> follow(x, subject)
[1] 1
(b): + / *
strand(x) <- "+"
strand(subject) <- "*"
> precede(x, subject)
[1] 3
> follow(x, subject)
[1] 1
(c): * / +
strand(x) <- "*"
strand(subject) <- "+"
> precede(x, subject)
[1] 3
> follow(x, subject)
[1] 1
(d): * / *
strand(x) <- strand(subject) <- "*"
> precede(x, subject)
[1] 1
> follow(x, subject)
[1] 1
What could be the motivation behind this? Wouldn't things be more
consistent and less confusing if (a), (b), (c), and (d) all behaved
consistently with (e):
> precede(ranges(x), ranges(subject))
[1] 3
> follow(ranges(x), ranges(subject))
[1] 1
Thanks,
H.
On 03/04/2013 09:57 AM, Michael Lawrence wrote:
> On Mon, Mar 4, 2013 at 9:17 AM, Valerie Obenchain <vobencha at fhcrc.org>wrote:
>
>> Hi Hermann, Harris,
>>
>> I wanted to give some background on the behavior of these functions for
>> GRanges vs IRanges.
>>
>>> gr <- GRanges("chr1", IRanges(c(1,2,3), width=1))
>>> gr
>>
>> GRanges with 3 ranges and 0 metadata columns:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] chr1 [1, 1] *
>> [2] chr1 [2, 2] *
>> [3] chr1 [3, 3] *
>>
>> -----------------------------
>> "+" strand behavior:
>> -----------------------------
>> IRanges objects have no strand so you can expect them to behave as "+"
>> strand GRanges do. If this is ever not the case please let me know. That
>> would be a bug.
>>
>>> strand(gr) <- "+"
>>> precede(gr)
>> [1] 2 3 NA
>>> precede(ranges(gr))
>> [1] 2 3 NA
>>
>> -----------------------------
>> "-" strand behavior:
>> -----------------------------
>> I think it makes intuitive sense that precede() and follow() on "-" strand
>> GRanges will give the opposite results from precede() and follow() on "+"
>> strand GRanges.
>>
>>> strand(gr) <- "-"
>>> precede(gr)
>> [1] NA 1 2
>>
>> -----------------------------
>> "*" strand behavior:
>> -----------------------------
>> For "*" strand GRanges maybe we tried to be too clever. When
>> ignore.strand=TRUE strand is treated as "+":
>>> strand(gr) <- "*"
>>> precede(gr, ignore.strand=TRUE)
>> [1] 2 3 NA
>>
>> When ignore.strand=FALSE for a "*" strand range, results are computed for
>> both "+' and "-". If one result is NA, the other is taken. This is how we
>> end up with a '2' in positions 1 and 3 (position 3 was NA for "+" and
>> position 1 was NA for "-"). If there is a tie, the first in order is taken.
>> This is how we get the '1' in position 2 (tie was between '1' and '3' and
>> '1' comes before '3' in order).
>>> precede(gr, ignore.strand=FALSE)
>> [1] 2 1 2
>>
>> It was a challenge to decide how "*" strand with ignore.strand=FALSE
>> should behave. If others have opinions on this I'd be interested in hearing
>> them.
>>
>>
>
> So it seems that "*" in GenomicRanges can mean at least two things:
> "either" or "missing". I think at one point these were represented by
> different values, but now the meanings have been merged for simplicity.
> Instead, we need to use an extra parameter, in this case ignore.strand, to
> distinguish the two meanings during an operation. It's often convenient to
> be able to change the meaning on a per-operation basis, but there is also a
> desire to keep the meaning in the data structure. Anyway, all this has been
> decided long ago and it probably does not make sense to change anything now.
>
> There is some concern though about the usability of the API. Usually, a "*"
> is used to select the natural/IRanges behavior, but the default of
> ignore.strand is FALSE, so the "either" meaning is assumed. This is going
> to surprise users. The same is true of ignore.strand in findOverlaps: many
> users are surprised to find that by default 'strand' separates the ranges
> into two coordinate systems, instead of just indicating direction.
>
> Michael
>
>
>> Valerie
>>
>>
>>
>>
>>
>>
>> On 03/03/13 16:38, Harris A. Jaffee wrote:
>>
>>> You want precede() and follow(), and then distance(), but tread somewhat
>>> carefully with strands of "*":
>>>
>>> x=IRanges(2, 2)
>>>> x
>>>>
>>> IRanges of length 1
>>> start end width
>>> [1] 2 2 1
>>>
>>> y=IRanges(c(1,3), c(1,3))
>>>> y
>>>>
>>> IRanges of length 2
>>> start end width
>>> [1] 1 1 1
>>> [2] 3 3 1
>>>
>>> precede(x, y)
>>>>
>>> [1] 2
>>>
>>>> follow(x, y)
>>>>
>>> [1] 1
>>>
>>> X = GRanges(ranges=x, seqnames="chr1")
>>>> Y = GRanges(ranges=y, seqnames="chr1")
>>>> X
>>>>
>>> GRanges with 1 range and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [2, 2] *
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>>> Y
>>>>
>>> GRanges with 2 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [1, 1] *
>>> [2] chr1 [3, 3] *
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>> precede(X, Y)
>>>>
>>> [1] 1
>>>
>>>> follow(X, Y)
>>>>
>>> [1] 1
>>>
>>> precede(X, Y, ignore.strand=TRUE)
>>>>
>>> [1] 2
>>>
>>>> follow(X, Y, ignore.strand=TRUE)
>>>>
>>> [1] 1
>>>
>>>
>>> On Mar 3, 2013, at 4:10 PM, Hermann Norpois wrote:
>>>
>>> Hello,
>>>>
>>>> I have to grange-objects - gr1 and gr2. And I wish to know:
>>>> 1) What are the neigbouring intervals of gr1 in gr2 - upstream and
>>>> downstream?
>>>> 2) What is the distance in bp between the neighbouring intervals?
>>>>
>>>> Could anybody give me a hint how this is done?
>>>> Thanks
>>>> Hermann
>>>>
>>>> dput (gr1)
>>>>>
>>>> new("GRanges"
>>>> , seqnames = new("Rle"
>>>> , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>> "chr12",
>>>> "chr13",
>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>> , lengths = 4L
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , ranges = new("IRanges"
>>>> , start = c(540823L, 752809L, 771015L, 773361L)
>>>> , width = c(221L, 230L, 520L, 247L)
>>>> , NAMES = NULL
>>>> , elementType = "integer"
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , strand = new("Rle"
>>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor")
>>>> , lengths = 4L
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , elementMetadata = new("DataFrame"
>>>> , rownames = NULL
>>>> , nrows = 4L
>>>> , listData = structure(list(edensity = c(589L, 574L, 578L, 571L),
>>>> epeak
>>>> = c(111L,
>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>> "epeak", "over"))
>>>> , elementType = "ANY"
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , seqinfo = new("Seqinfo"
>>>> , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>>> "chr15",
>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>> "chrX", "chrY")
>>>> , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>> , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> NA, NA,
>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>> , genome = c(NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
>>>> )
>>>> )
>>>> , metadata = list()
>>>> )
>>>>
>>>>> dput (gr1)
>>>>>
>>>> new("GRanges"
>>>> , seqnames = new("Rle"
>>>> , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>>> "chr12",
>>>> "chr13",
>>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>> , lengths = 4L
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , ranges = new("IRanges"
>>>> , start = c(540823L, 752809L, 771015L, 773361L)
>>>> , width = c(221L, 230L, 520L, 247L)
>>>> , NAMES = NULL
>>>> , elementType = "integer"
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , strand = new("Rle"
>>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor")
>>>> , lengths = 4L
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , elementMetadata = new("DataFrame"
>>>> , rownames = NULL
>>>> , nrows = 4L
>>>> , listData = structure(list(edensity = c(589L, 574L, 578L, 571L),
>>>> epeak
>>>> = c(111L,
>>>> 89L, 234L, 136L), over = c(0L, 1L, 1L, 1L)), .Names = c("edensity",
>>>> "epeak", "over"))
>>>> , elementType = "ANY"
>>>> , elementMetadata = NULL
>>>> , metadata = list()
>>>> )
>>>> , seqinfo = new("Seqinfo"
>>>> , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>>> "chr15",
>>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>>> "chrX", "chrY")
>>>> , seqlengths = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>> , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> NA, NA,
>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>> , genome = c(NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>>> NA_character_,
>>>> NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
>>>> )
>>>> )
>>>> , metadata = list()
>>>> )
>>>>
>>>> gr1
>>>>>
>>>> GRanges with 4 ranges and 3 metadata columns:
>>>> seqnames ranges strand | edensity epeak over
>>>> <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
>>>> [1] chr1 [540823, 541043] * | 589 111 0
>>>> [2] chr1 [752809, 753038] * | 574 89 1
>>>> [3] chr1 [771015, 771534] * | 578 234 1
>>>> [4] chr1 [773361, 773607] * | 571 136 1
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8 chr9 chrX
>>>> chrY
>>>> NA NA NA NA NA NA ... NA NA NA NA NA
>>>> NA
>>>>
>>>>> gr2
>>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>> seqnames ranges strand | edensity epeak over
>>>> <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
>>>> [1] chr1 [ 53141, 53685] * | 601 212 0
>>>> [2] chr1 [521536, 521622] * | 568 37 1
>>>> [3] chr1 [805002, 805650] * | 1000 326 1
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8 chr9 chrX
>>>> chrY
>>>> NA NA NA NA NA NA ... NA NA NA NA NA
>>>> NA
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives:http://news.gmane.**org/gmane.science.biology.**
>>> informatics.conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list