[BioC] GenomicRanges: nearest() for GRanges not returning overlaps

Tue Jul 3 16:09:57 CEST 2012

Hi,

I read this: https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-June/046287.html

However, after reading it I'm a little confused as to what the default
behaviour should be when using nearest with '*', and also how to get
back to what I believe was the previous, "IRanges" style behaviour.

Is nearest on GRanges supposed to act like nearest on IRanges, or is
it instead supposed to return the nearest neighbour OTHER than the
query one?

I.e.

> query <- IRanges(c(1, 3, 9), c(2, 7, 10))
> subject <- IRanges(c(3, 5, 12), c(3, 6, 12))
> nearest(query, subject)
[1] 1 1 3

Because, it seems to be behaving differently, i.e. returning the
neighbour only, i.e.

> query2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12), c(2, 7, 12)))
> subject2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12), c(3, 6, 12)))
> nearest(query2, subject2)
[1] 1 3 2
>
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US             LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] stats4_2.15.1

However, what makes me confused is that the previous behaviour was like IRanges:

> query <- IRanges(c(1, 3, 9), c(2, 7, 10))
> subject <- IRanges(c(3, 5, 12), c(3, 6, 12))
> nearest(query, subject)
[1] 1 1 3
>
> query2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12), c(2, 7, 12)))
> subject2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12), c(3, 6, 12)))
> nearest(query2, subject2)
[1] 1 1 3
>
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicRanges_1.8.6 IRanges_1.14.3      BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] stats4_2.15.0

And in fact, this has always been the behaviour right? i.e.: for R 2.13.1

> query <- IRanges(c(1, 3, 9), c(2, 7, 10))
> subject <- IRanges(c(3, 5, 12), c(3, 6, 12))
> nearest(query, subject)
[1] 1 1 3
>
> query2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12), c(2, 7, 12)))
> subject2 <- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12), c(3, 6, 12)))
> nearest(query2, subject2)
[1] 1 1 3
>
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicRanges_1.4.8 IRanges_1.10.6

Is ignore.strand=TRUE intended to get the IRanges-like behaviour?
Because I have problems with this too:

 nearest(x = query2, subject = subject2, ignore.strand=TRUE)
Error in strand(x) <- strand(subject) <- "+" : object 'x' not found

Thanks!

Jim