[BioC] Semantics of GenomicRanges gaps()

Alex Gutteridge alexg at ruggedtextile.com
Fri May 31 10:46:45 CEST 2013


I just have a quick question/comment about the behaviour of the gaps 
function in the GenomicRanges package, particularly with how it handles 
* strand ranges. The current behaviour is as below:

> range = GRanges(seqnames="1",IRanges(start=300,end=700), strand="*", 
> seqinfo=Seqinfo("1",seqlengths=1000))
> range
GRanges with 1 range and 0 metadata columns:
       seqnames     ranges strand
          <Rle>  <IRanges>  <Rle>
   [1]        1 [300, 700]      *
   ---
   seqlengths:
       1
    1000
> gaps(range)
GRanges with 4 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
   [1]        1 [  1, 1000]      +
   [2]        1 [  1, 1000]      -
   [3]        1 [  1,  299]      *
   [4]        1 [701, 1000]      *
   ---
   seqlengths:
       1
    1000

My interpretation of "*" as a strand identifier is that it means the 
range covers both + and - strands and so intuitively I was expecting the 
'gaps' to only cover the 3rd and 4th ranges returned above (not the 
full-length + and - strand ranges). The semantics here implies to me 
that the * strand is being thought of as a kind of imaginary third 
strand and hence doesn't overlap with the + or - strands. This is 
contrary to the semantics of findOverlaps which does appear to consider 
a * strand range to overlap with + or - strand ranges:

> findOverlaps(range,gaps(range))
Hits of length 2
queryLength: 1
subjectLength: 4
   queryHits subjectHits
    <integer>   <integer>
  1         1           1
  2         1           2

To summarise I guess I was expecting findOverlaps(range,gaps(range)) to 
never return any overlaps under any circumstance (that I can think of!). 
Do people agree the behaviour of gaps() is not quite intuitive in this 
case?

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0

loaded via a namespace (and not attached):
[1] parallel_2.15.2 stats4_2.15.2

-- 
Alex Gutteridge



More information about the Bioconductor mailing list