[BioC] shift function (GenomicRanges, IRanges), issue with negative start values and circular DNA
Hervé Pagès
hpages at fhcrc.org
Mon Jan 23 20:51:04 CET 2012
Hi Robert,
On 01/23/2012 08:13 AM, Ivanek, Robert wrote:
> Hi Hervé,
>
> On 01/23/2012 01:04 AM, Hervé Pagès wrote:
>> Hi Robert,
>>
>> Sorry that this completely went off my radar but this week someone
>> kindly sent me a reminder.
>>
>> So after re-reading your email I agree that shift() should behave
>> consistently on both ends of the chromosome but I don't like the idea
>> that it automatically trims the ranges for you. I think we should
>> shift without trimming, but with a warning that some ranges are going
>> off limits.
>
> This sounds reasonable. User can decide what to do with them.
>
>
>> Then it would be very easy for the user to trim those ranges if
>> s/he wanted, using trim() (there is currently no "trim" method for
>> GRanges objects but I'll add one). Note that for ranges that are
>> completely off limits (e.g. start=-20, end=-4), it's not clear
>> what trim() should do: (a) fail, (b) produce a 0-width range located
>> on the edge (e.g. start=1, end=0), with or (c) without warning.
>
> I think, that the GRanges cannot have negative widths, so I would
> suggest that they would be trimmed to (start=1,end=1) or
> (start=seqlength, end=seqlength) by default.
The width of a range must be >= 0. The width is always equal to
end - start + 1 so for a 0-width range the end is equal to start - 1.
This is the only situation where the end < start. Sounds weird but
this way of representing 0-width ranges is just following the general
rule in the IRanges/GenomicRanges/GenomicFeatures/ShortRead world
that integer ranges are presented to the user as closed intervals.
So completely off limit ranges could be trimmed to
(start=1,end=0) or (start=seqlength+1, end=seqlength).
This is actually what restrict() does when used with
keep.all.ranges=TRUE:
> x <- GRanges("chr1", IRanges(c(-20, 3, 12), c(-4, 15, 15)))
> restrict(x, start=1, end=10, keep.all.ranges=TRUE)
GRanges with 3 ranges and 0 elementMetadata values:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 1, 0] *
[2] chr1 [ 3, 10] *
[3] chr1 [11, 10] *
---
seqlengths:
chr1
NA
>
>> This is one of the reasons I think this task is better done separately.
>> The other reason to not do the trimming in shift() is to maintain
>> the nice property that shift(shift(x, N), -N) should always be a no-op.
>>
>> When the chromosome is circular, I think we should do the same, except
>> that no warning should be issued when a range goes off limits.
>
> Does the GRanges class allow the negative width on circular chromosomes?
No, the width is always >= 0, even for ranges defined on a circular
chromosome. Allowing negative widths would be like opening a can of
worms.
Thanks for your feedback,
H.
>
> Cheers
>
> Robert
>
>
>>
>> Does that sound reasonable?
>>
>> Cheers,
>> H.
>>
>>
>> On 09/01/2011 08:23 AM, Ivanek, Robert wrote:
>>> Hi Michael,
>>>
>>> is there any plan to fixed these issues?
>>>
>>> Best Regards
>>>
>>> Robert
>>>
>>> On 09/01/2011 03:37 PM, Michael Lawrence wrote:
>>>> Nevermind, I misunderstood what you were trying to do.
>>>>
>>>> On Thu, Sep 1, 2011 at 6:36 AM, Michael Lawrence <michafla at gene.com
>>>> <mailto:michafla at gene.com>> wrote:
>>>>
>>>> As a sort of side-comment, you might be better off doing this with
>>>> resize(), i.e.,
>>>>
>>>> resize(gr, width(gr) - 50, fix = "end")
>>>>
>>>> Michael
>>>>
>>>> On Thu, Sep 1, 2011 at 5:27 AM, Ivanek, Robert <robert.ivanek at fmi.ch
>>>> <mailto:robert.ivanek at fmi.ch>> wrote:
>>>>
>>>> Dear Bioc developers,
>>>>
>>>> I am analysing the ChIP-Seq data using the GenomicRanges
>>>> package. I am usually shifting the reads to the middle of the
>>>> fragment (i.e. by 50 bp toward 3' end of the fragment according
>>>> the strand of the read), using the shift function (from
>>>> IRanges), which works in most cases.
>>>>
>>>> shift(gr, shift=50 * as.integer(ifelse(strand(gr)==__"-", -1, 1)))
>>>>
>>>> However sometimes the result does not look like as I would expect:
>>>>
>>>> 1. If the read is on the minus strand and the start value is
>>>> smaller than the shift value, it produces an error:
>>>>
>>>> ## Error in validObject(x) :
>>>> ## invalid class “IRanges” object: 'widths(x)' cannot contain
>>>> negative values
>>>>
>>>> If the read is on the plus strand at the end of the chromosome
>>>> it returns the size of the chromosome as a new end value. Would
>>>> be possible to apply similar strategy for the beginning of the
>>>> chromosome (It would return 1 as a start value)?
>>>>
>>>>
>>>>
>>>> 2. The shift function does not take into account information
>>>> about circularity of the chromosome. Is there any plan to
>>>> include it?
>>>>
>>>>
>>>> Please see below reproducible examples
>>>>
>>>> Best Regards
>>>>
>>>> Robert
>>>>
>>>> --
>>>> Robert Ivanek
>>>> Postdoctoral Fellow Schuebeler Group
>>>> Friedrich Miescher Institute
>>>> Maulbeerstrasse 66
>>>> 4058 Basel / Switzerland
>>>> Office phone: +41 61 697 6100 <tel:%2B41%2061%20697%206100>
>>>>
>>>>
>>>>
>>>>
>>>> EXAMPLES:
>>>>
>>>>
>>>> ##
>>>> ##############################__##############################__#################
>>>>
>>>>
>>>>
>>>> ##
>>>> library("GenomicRanges")
>>>> ## Loading required package: IRanges
>>>> ## Attaching package: ‘IRanges’
>>>> ## The following object(s) are masked from ‘package:base’:
>>>> ## cbind, eval, intersect, Map, mapply, order, paste, pmax,
>>>> pmax.int <http://pmax.int>, pmin, pmin.int <http://pmin.int>,
>>>> rbind, rep.int <http://rep.int>, setdiff, table, union
>>>>
>>>> ##
>>>> ##
>>>> ##############################__##############################__#################
>>>>
>>>>
>>>>
>>>> ##
>>>> chr1.gr <http://chr1.gr> <- GRanges(seqnames=Rle("chr1", 4),
>>>> ranges=IRanges(start=rep(c(1,__197195432),each=2), width=1),
>>>> strand=rep(c("+","-"),2),
>>>> seqlengths=c(chr1=197195432))
>>>> ##
>>>>
>>>> isCircular(chr1.gr <http://chr1.gr>) <- FALSE
>>>> ##
>>>>
>>>> chr1.gr <http://chr1.gr>
>>>> ## GRanges with 4 ranges and 0 elementMetadata values:
>>>> ## seqnames ranges strand
>>>> ## <Rle> <IRanges> <Rle>
>>>> ## [1] chr1 [ 1, 1] +
>>>> ## [2] chr1 [ 1, 1] -
>>>> ## [3] chr1 [197195432, 197195432] +
>>>> ## [4] chr1 [197195432, 197195432] -
>>>> ## ---
>>>> ## seqlengths:
>>>> ## chr1
>>>> ## 197195432
>>>> ##
>>>>
>>>> seqinfo(chr1.gr <http://chr1.gr>)
>>>> ## Seqinfo of length 1
>>>> ## seqnames seqlengths isCircular
>>>> ## chr1 197195432 FALSE
>>>> ##
>>>>
>>>> shift(chr1.gr <http://chr1.gr>, shift=50)
>>>> ## GRanges with 4 ranges and 0 elementMetadata values:
>>>> ## seqnames ranges strand
>>>> ## <Rle> <IRanges> <Rle>
>>>> ## [1] chr1 [ 51, 51] +
>>>> ## [2] chr1 [ 51, 51] -
>>>> ## [3] chr1 [197195432, 197195432] +
>>>> ## [4] chr1 [197195432, 197195432] -
>>>> ## ---
>>>> ## seqlengths:
>>>> ## chr1
>>>> ## 197195432
>>>> ## Warning message:
>>>> ## In `end<-`(`*tmp*`, check = FALSE, value = c(51L, 51L,
>>>> 197195482L, :
>>>> ## trimmed end values to be <= seqlengths
>>>> ##
>>>>
>>>> shift(chr1.gr <http://chr1.gr>, shift=50 *
>>>> as.integer(ifelse(strand(chr1.__gr <http://chr1.gr>)=="-", -1, 1)))
>>>> ## Error in validObject(x) :
>>>> ## invalid class “IRanges” object: 'widths(x)' cannot contain
>>>> negative values
>>>> ## In addition: Warning messages:
>>>> ## 1: In `end<-`(`*tmp*`, check = FALSE, value = c(51L, -49L,
>>>> 197195482L, :
>>>> ## trimmed end values to be <= seqlengths
>>>> ## 2: In `start<-`(`*tmp*`, value = c(51L, -49L, 197195432L,
>>>> 197195382L :
>>>> ## trimmed start values to be positive
>>>> ##
>>>> ##
>>>> ##############################__##############################__#################
>>>>
>>>>
>>>>
>>>> ##
>>>> chrM.gr <- GRanges(seqnames=Rle("chrM", 4),
>>>> ranges=IRanges(start=rep(c(1,__16299),each=2), width=1),
>>>> strand=rep(c("+","-"),2),
>>>> seqlengths=c(chrM=16299))
>>>> ##
>>>>
>>>> isCircular(chrM.gr) <- TRUE
>>>> ##
>>>>
>>>> chrM.gr
>>>> ## GRanges with 4 ranges and 0 elementMetadata values:
>>>> ## seqnames ranges strand
>>>> ## <Rle> <IRanges> <Rle>
>>>> ## [1] chrM [ 1, 1] +
>>>> ## [2] chrM [ 1, 1] -
>>>> ## [3] chrM [16299, 16299] +
>>>> ## [4] chrM [16299, 16299] -
>>>> ## ---
>>>> ## seqlengths:
>>>> ## chrM
>>>> ## 16299
>>>> ##
>>>>
>>>> seqinfo(chrM.gr)
>>>> ## Seqinfo of length 1
>>>> ## seqnames seqlengths isCircular
>>>> ## chrM 16299 TRUE
>>>> ##
>>>>
>>>> shift(chrM.gr, shift=50)
>>>> ## GRanges with 4 ranges and 0 elementMetadata values:
>>>> ## seqnames ranges strand
>>>> ## <Rle> <IRanges> <Rle>
>>>> ## [1] chrM [ 51, 51] +
>>>> ## [2] chrM [ 51, 51] -
>>>> ## [3] chrM [16299, 16299] +
>>>> ## [4] chrM [16299, 16299] -
>>>> ## ---
>>>> ## seqlengths:
>>>> ## chrM
>>>> ## 16299
>>>> ## Warning message:
>>>> ## In `end<-`(`*tmp*`, check = FALSE, value = c(51L, 51L,
>>>> 16349L, 16349L :
>>>> ## trimmed end values to be <= seqlengths
>>>> ##
>>>>
>>>> shift(chrM.gr, shift=50 *
>>>> as.integer(ifelse(strand(chrM.__gr)=="-", -1, 1)))
>>>> ## Error in validObject(x) :
>>>> ## invalid class “IRanges” object: 'widths(x)' cannot contain
>>>> negative values
>>>> ## In addition: Warning messages:
>>>> ## 1: In `end<-`(`*tmp*`, check = FALSE, value = c(51L, -49L,
>>>> 16349L, :
>>>> ## trimmed end values to be <= seqlengths
>>>> ## 2: In `start<-`(`*tmp*`, value = c(51L, -49L, 16299L, 16249L)) :
>>>> ## trimmed start values to be positive
>>>> ##
>>>> ##
>>>> ##
>>>> ##############################__##############################__#################
>>>>
>>>>
>>>>
>>>> ##
>>>> sessionInfo()
>>>> ## R Under development (unstable) (2011-08-29 r56824)
>>>> ## Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> ## locale:
>>>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
>>>> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
>>>> ## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>>>> LC_TELEPHONE=C
>>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> ## attached base packages:
>>>> ## [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> ## other attached packages:
>>>> ## [1] GenomicRanges_1.5.31 IRanges_1.11.26
>>>> ##
>>>> ##
>>>> ##############################__##############################__#################
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>
>>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list