[BioC] Semantics of GenomicRanges gaps()
Alex Gutteridge
alexg at ruggedtextile.com
Fri May 31 10:46:45 CEST 2013
I just have a quick question/comment about the behaviour of the gaps
function in the GenomicRanges package, particularly with how it handles
* strand ranges. The current behaviour is as below:
> range = GRanges(seqnames="1",IRanges(start=300,end=700), strand="*",
> seqinfo=Seqinfo("1",seqlengths=1000))
> range
GRanges with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [300, 700] *
---
seqlengths:
1
1000
> gaps(range)
GRanges with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 1, 1000] +
[2] 1 [ 1, 1000] -
[3] 1 [ 1, 299] *
[4] 1 [701, 1000] *
---
seqlengths:
1
1000
My interpretation of "*" as a strand identifier is that it means the
range covers both + and - strands and so intuitively I was expecting the
'gaps' to only cover the 3rd and 4th ranges returned above (not the
full-length + and - strand ranges). The semantics here implies to me
that the * strand is being thought of as a kind of imaginary third
strand and hence doesn't overlap with the + or - strands. This is
contrary to the semantics of findOverlaps which does appear to consider
a * strand range to overlap with + or - strand ranges:
> findOverlaps(range,gaps(range))
Hits of length 2
queryLength: 1
subjectLength: 4
queryHits subjectHits
<integer> <integer>
1 1 1
2 1 2
To summarise I guess I was expecting findOverlaps(range,gaps(range)) to
never return any overlaps under any circumstance (that I can think of!).
Do people agree the behaviour of gaps() is not quite intuitive in this
case?
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] parallel_2.15.2 stats4_2.15.2
--
Alex Gutteridge
More information about the Bioconductor
mailing list