[BioC] Semantics of GenomicRanges gaps()

Dan Du tooyoung at gmail.com
Wed Aug 21 09:13:15 CEST 2013


Hi all,

Sorry to wake up this sleeping beauty.

Just something more to add and consider, there is another inconsistency
in the gaps function for GRanges, when there is no seqinfo present. The
default gaps will not produce extra ranges for those unused strands,
unless one specify the "end" argument, however only setting "start"
won't trigger the extra results. This is the current behavior of both
release and dev branch,

## rls
GenomicRanges_1.13.36 IRanges_1.19.24 
## dev
GenomicRanges_1.12.4 IRanges_1.18.3 

## case without seqinfo
g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
seqinfo=Seqinfo("1",seqlengths=1000))
g
GRanges with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1   [3, 10]      *
  ---
  seqlengths:
    1
   NA

## default, no extra
gaps(g) # this is expected, start always defaults to 1L
GRanges with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1    [1, 2]      *
  ---
  seqlengths:
    1
   NA

## with start, no extra
gaps(g, start=2) 
GRanges with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1    [2, 2]      *
  ---
  seqlengths:
    1
   NA
## with end, strand comes into play, giving two more
gaps(g, end=20)
GRanges with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1  [ 1, 20]      +
  [2]        1  [ 1, 20]      -
  [3]        1  [ 1,  2]      *
  [4]        1  [11, 20]      *
  ---
  seqlengths:
    1
   NA

## with both start and end, same as above
gaps(g, start=2, end=20) #
GRanges with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1  [ 2, 20]      +
  [2]        1  [ 2, 20]      -
  [3]        1  [ 2,  2]      *
  [4]        1  [11, 20]      *
  ---
  seqlengths:
    1
## reset seqinfo
seqinfo(g)<-Seqinfo("1",seqlengths=1000)

## now the extra ranges are back
gaps(g, start=2)
GRanges with 4 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [ 2, 1000]      +
  [2]        1 [ 2, 1000]      -
  [3]        1 [ 2,    2]      *
  [4]        1 [11, 1000]      *
  ---
  seqlengths:
      1
   1000

Best,
Dan

On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
> Not sure how confusing this would be, but a common case is (like Alex's
> input) a set of ranges where the strand is uniform, i.e., all ranges have
> strand 'x'. In that case, gaps,GenomicRanges could behave just like
> gaps,Ranges and return only ranges on strand 'x'. That would satisfy both
> properties.
> 
> A natural extension would be that if all ranges are '+' or '-' (none of
> them are '*'), then do not add the '*' range spanning the whole sequence.
> Also satisfies both properties.
> 
> I only know of one use-case where all three strand types are mixed:
> rna-seq, where the strand has been inferred only from the spliced
> alignments. Not if gaps() would even be useful in that case, but if it
> were, I think the current gaps behavior would make sense.
> 
> Michael
> 
> 
> 
> 
> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at fhcrc.org> wrote:
> 
> > Hi Alex,
> >
> >
> > On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
> >
> >> I just have a quick question/comment about the behaviour of the gaps
> >> function in the GenomicRanges package, particularly with how it handles
> >> * strand ranges. The current behaviour is as below:
> >>
> >>  range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*",
> >>> seqinfo=Seqinfo("1",**seqlengths=1000))
> >>> range
> >>>
> >> GRanges with 1 range and 0 metadata columns:
> >>        seqnames     ranges strand
> >>           <Rle>  <IRanges>  <Rle>
> >>    [1]        1 [300, 700]      *
> >>    ---
> >>    seqlengths:
> >>        1
> >>     1000
> >>
> >>> gaps(range)
> >>>
> >> GRanges with 4 ranges and 0 metadata columns:
> >>        seqnames      ranges strand
> >>           <Rle>   <IRanges>  <Rle>
> >>    [1]        1 [  1, 1000]      +
> >>    [2]        1 [  1, 1000]      -
> >>    [3]        1 [  1,  299]      *
> >>    [4]        1 [701, 1000]      *
> >>    ---
> >>    seqlengths:
> >>        1
> >>     1000
> >>
> >> My interpretation of "*" as a strand identifier is that it means the
> >> range covers both + and - strands and so intuitively I was expecting the
> >> 'gaps' to only cover the 3rd and 4th ranges returned above (not the
> >> full-length + and - strand ranges).
> >>
> >
> > It's even worse than that. If there is no range at all on a chromosome,
> > gaps(gr) will return 3 ranges covering the full chromosome:
> >
> >   > gr <- GRanges("chr1",
> >                   IRanges(start=300,end=700),
> >                   seqlengths=c(chr1=1000,chr2=**1000))
> >
> >   > gr
> >
> >   GRanges with 1 range and 0 metadata columns:
> >         seqnames     ranges strand
> >            <Rle>  <IRanges>  <Rle>
> >     [1]     chr1 [300, 700]      *
> >     ---
> >     seqlengths:
> >      chr1  chr2
> >      1000 1000
> >
> >   > gaps(gr)
> >
> >   GRanges with 7 ranges and 0 metadata columns:
> >         seqnames      ranges strand
> >            <Rle>   <IRanges>  <Rle>
> >     [1]     chr1 [  1, 1000]      +
> >     [2]     chr1 [  1, 1000]      -
> >     [3]     chr1 [  1,  299]      *
> >     [4]     chr1 [701, 1000]      *
> >     [5]     chr2 [  1, 1000]      +
> >     [6]     chr2 [  1, 1000]      -
> >     [7]     chr2 [  1, 1000]      *
> >     ---
> >     seqlengths:
> >      chr1  chr2
> >      1000 1000
> >
> >
> >  The semantics here implies to me
> >> that the * strand is being thought of as a kind of imaginary third
> >> strand and hence doesn't overlap with the + or - strands. This is
> >> contrary to the semantics of findOverlaps which does appear to consider
> >> a * strand range to overlap with + or - strand ranges:
> >>
> >>  findOverlaps(range,gaps(range)**)
> >>>
> >> Hits of length 2
> >> queryLength: 1
> >> subjectLength: 4
> >>    queryHits subjectHits
> >>     <integer>   <integer>
> >>   1         1           1
> >>   2         1           2
> >>
> >> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to
> >> never return any overlaps under any circumstance (that I can think of!).
> >>
> >
> > Let's call this property 2. This sounds indeed like a good property
> > that it would be nice to have. But an even more important property is
> > property 1: gaps() must be its own reverse operation i.e.
> > 'gaps(gaps(gr))' must always return 'gr'.
> >
> > The current behavior of gaps() guarantees property 1. I'm not against
> > changing gaps() behavior to also have property 2, as long as property
> > 1 is preserved. Suggestions are welcome.
> >
> > Thanks,
> > H.
> >
> >
> >  Do people agree the behaviour of gaps() is not quite intuitive in this
> >> case?
> >>
> >>  sessionInfo()
> >>>
> >> R version 2.15.2 (2012-10-26)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>
> >> locale:
> >>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >>   [7] LC_PAPER=C                 LC_NAME=C
> >>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>
> >> other attached packages:
> >> [1] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0
> >>
> >> loaded via a namespace (and not attached):
> >> [1] parallel_2.15.2 stats4_2.15.2
> >>
> >>
> > --
> > Herv Pags
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> >
> > ______________________________**_________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> > Search the archives: http://news.gmane.org/gmane.**
> > science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list