[BioC] Semantics of GenomicRanges gaps()
Valerie Obenchain
vobencha at fhcrc.org
Fri Aug 23 20:31:46 CEST 2013
Hi Dan,
Thanks for catching this. Herve is currently out of town. We'll put this
on the TODO and address it when he is back.
Valerie
On 08/21/2013 12:13 AM, Dan Du wrote:
> Hi all,
>
> Sorry to wake up this sleeping beauty.
>
> Just something more to add and consider, there is another inconsistency
> in the gaps function for GRanges, when there is no seqinfo present. The
> default gaps will not produce extra ranges for those unused strands,
> unless one specify the "end" argument, however only setting "start"
> won't trigger the extra results. This is the current behavior of both
> release and dev branch,
>
> ## rls
> GenomicRanges_1.13.36 IRanges_1.19.24
> ## dev
> GenomicRanges_1.12.4 IRanges_1.18.3
>
> ## case without seqinfo
> g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
> seqinfo=Seqinfo("1",seqlengths=1000))
> g
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [3, 10] *
> ---
> seqlengths:
> 1
> NA
>
> ## default, no extra
> gaps(g) # this is expected, start always defaults to 1L
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [1, 2] *
> ---
> seqlengths:
> 1
> NA
>
> ## with start, no extra
> gaps(g, start=2)
> GRanges with 1 range and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [2, 2] *
> ---
> seqlengths:
> 1
> NA
> ## with end, strand comes into play, giving two more
> gaps(g, end=20)
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [ 1, 20] +
> [2] 1 [ 1, 20] -
> [3] 1 [ 1, 2] *
> [4] 1 [11, 20] *
> ---
> seqlengths:
> 1
> NA
>
> ## with both start and end, same as above
> gaps(g, start=2, end=20) #
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [ 2, 20] +
> [2] 1 [ 2, 20] -
> [3] 1 [ 2, 2] *
> [4] 1 [11, 20] *
> ---
> seqlengths:
> 1
> ## reset seqinfo
> seqinfo(g)<-Seqinfo("1",seqlengths=1000)
>
> ## now the extra ranges are back
> gaps(g, start=2)
> GRanges with 4 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 1 [ 2, 1000] +
> [2] 1 [ 2, 1000] -
> [3] 1 [ 2, 2] *
> [4] 1 [11, 1000] *
> ---
> seqlengths:
> 1
> 1000
>
> Best,
> Dan
>
> On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
>> Not sure how confusing this would be, but a common case is (like Alex's
>> input) a set of ranges where the strand is uniform, i.e., all ranges have
>> strand 'x'. In that case, gaps,GenomicRanges could behave just like
>> gaps,Ranges and return only ranges on strand 'x'. That would satisfy both
>> properties.
>>
>> A natural extension would be that if all ranges are '+' or '-' (none of
>> them are '*'), then do not add the '*' range spanning the whole sequence.
>> Also satisfies both properties.
>>
>> I only know of one use-case where all three strand types are mixed:
>> rna-seq, where the strand has been inferred only from the spliced
>> alignments. Not if gaps() would even be useful in that case, but if it
>> were, I think the current gaps behavior would make sense.
>>
>> Michael
>>
>>
>>
>>
>> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at fhcrc.org> wrote:
>>
>>> Hi Alex,
>>>
>>>
>>> On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
>>>
>>>> I just have a quick question/comment about the behaviour of the gaps
>>>> function in the GenomicRanges package, particularly with how it handles
>>>> * strand ranges. The current behaviour is as below:
>>>>
>>>> range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*",
>>>>> seqinfo=Seqinfo("1",**seqlengths=1000))
>>>>> range
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] 1 [300, 700] *
>>>> ---
>>>> seqlengths:
>>>> 1
>>>> 1000
>>>>
>>>>> gaps(range)
>>>>>
>>>> GRanges with 4 ranges and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] 1 [ 1, 1000] +
>>>> [2] 1 [ 1, 1000] -
>>>> [3] 1 [ 1, 299] *
>>>> [4] 1 [701, 1000] *
>>>> ---
>>>> seqlengths:
>>>> 1
>>>> 1000
>>>>
>>>> My interpretation of "*" as a strand identifier is that it means the
>>>> range covers both + and - strands and so intuitively I was expecting the
>>>> 'gaps' to only cover the 3rd and 4th ranges returned above (not the
>>>> full-length + and - strand ranges).
>>>>
>>>
>>> It's even worse than that. If there is no range at all on a chromosome,
>>> gaps(gr) will return 3 ranges covering the full chromosome:
>>>
>>> > gr <- GRanges("chr1",
>>> IRanges(start=300,end=700),
>>> seqlengths=c(chr1=1000,chr2=**1000))
>>>
>>> > gr
>>>
>>> GRanges with 1 range and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [300, 700] *
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> 1000 1000
>>>
>>> > gaps(gr)
>>>
>>> GRanges with 7 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [ 1, 1000] +
>>> [2] chr1 [ 1, 1000] -
>>> [3] chr1 [ 1, 299] *
>>> [4] chr1 [701, 1000] *
>>> [5] chr2 [ 1, 1000] +
>>> [6] chr2 [ 1, 1000] -
>>> [7] chr2 [ 1, 1000] *
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> 1000 1000
>>>
>>>
>>> The semantics here implies to me
>>>> that the * strand is being thought of as a kind of imaginary third
>>>> strand and hence doesn't overlap with the + or - strands. This is
>>>> contrary to the semantics of findOverlaps which does appear to consider
>>>> a * strand range to overlap with + or - strand ranges:
>>>>
>>>> findOverlaps(range,gaps(range)**)
>>>>>
>>>> Hits of length 2
>>>> queryLength: 1
>>>> subjectLength: 4
>>>> queryHits subjectHits
>>>> <integer> <integer>
>>>> 1 1 1
>>>> 2 1 2
>>>>
>>>> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to
>>>> never return any overlaps under any circumstance (that I can think of!).
>>>>
>>>
>>> Let's call this property 2. This sounds indeed like a good property
>>> that it would be nice to have. But an even more important property is
>>> property 1: gaps() must be its own reverse operation i.e.
>>> 'gaps(gaps(gr))' must always return 'gr'.
>>>
>>> The current behavior of gaps() guarantees property 1. I'm not against
>>> changing gaps() behavior to also have property 2, as long as property
>>> 1 is preserved. Suggestions are welcome.
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> Do people agree the behaviour of gaps() is not quite intuitive in this
>>>> case?
>>>>
>>>> sessionInfo()
>>>>>
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.15.2 stats4_2.15.2
>>>>
>>>>
>>> --
>>> Herv Pags
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list