[BioC] Exclude overlapping intervals
Cook, Malcolm
MEC at stowers.org
Fri Dec 14 20:24:32 CET 2012
Hermann, did you get the solution you needed out of this?
Herve, please check out my notes below. I think you will be amused/intersted....
.On 12/13/2012 12:00 PM, Cook, Malcolm wrote:
.> Hi, yall,
.>
.> I'm thinking Hermann is looking where the coverage of gr is 1, like this:
.>
.> library(functional)
.> lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges))
.> $chr1
.> NormalIRanges of length 5
.> start end width
.> [1] 4 10 7
.> [2] 12 19 8
.> [3] 45 54 10
.> [4] 81 100 20
.> [5] 105 200 96
.>
.>
.> ... which answers the question but the data is improperly shaped.
.>
.> We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz:
.>> grd<-disjoin(gr)
.>> grd[1==countOverlaps(grd,gr)]
.> GRanges with 5 ranges and 0 metadata columns:
.> seqnames ranges strand
.> <Rle> <IRanges> <Rle>
.> [1] chr1 [ 4, 10] *
.> [2] chr1 [ 12, 19] *
.> [3] chr1 [ 45, 54] *
.> [4] chr1 [ 81, 100] *
.> [5] chr1 [105, 200] *
.> ---
.> seqlengths:
.> chr1
.> NA
.>
.> So, which way is 'best' for performance?
.
.Note that the 2 solutions are not equivalent if the ranges are
.stranded: the former will ignore the strand whereas the latter
.will consider it. Even if the ranges are not stranded (like in
.your example), you still need to reduce() the result of the latter
.in order to get exactly the same set of ranges as with the former.
.
.2 alternatives to solution 1 with result properly shaped:
.
.** Sol 1b:
.
. irl <- IRangesList(lapply(coverage(gr) == 1L, as, "IRanges"))
. gr1b <- as(irl, "GRanges")
.
.** Sol 1c (using slice):
.
. gr1c <- as(slice(coverage(gr), lower=1L, upper=1L), "GRanges")
.
.Sol 1b seems to be the fastest.
.
Herve, thanks for these observations re: stranded and the need to reduce the results of my method 1.
In the most general case I agree, you are right, and the strand preserving method should be:
grd<-disjoin(gr)
reduce(grd[1==countOverlaps(grd,gr)])
Trying to appreciate and simplify your soln1, I rewrote it as a one-liner, thusly:
as(as(coverage(gr) == 1L,'IRangesList'),'GRanges')
which prompts me to abstract as follows:
> setAs('RleList','GRanges',
function(from) {
as(as(from,'IRangesList'),'GRanges')
})
allowing to express the problem most perspicuously like this:
> as(coverage(gr) == 1L,'GRanges')
GRanges with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 4, 10] *
[2] chr1 [ 12, 19] *
[3] chr1 [ 45, 54] *
[4] chr1 [ 81, 100] *
[5] chr1 [105, 200] *
---
seqlengths:
chr1
NA
>
Is this in the spirit? Is my SetAs making assumptions it should not?
Opinions anyone?
Oh! HEY! During this exploration, I discover to my great astonishment that `==` is not symetric at least as regards to name preservation !!!!!
> names(1==coverage(gr))
NULL
> names(coverage(gr)==1)
[1] "chr1"
Is there an explanation for this antisymetry? Is this to be expected? Should it be remedied? Is this another thread?
I am SURE that I am not the only one to have gotten bitten by this in the past. In fact, I think that I have run up in the past and have not realized that switch the order of the operands 'fixes' the problem.
As a result of this issue, for instance, my elegant (?) abstraction won't work when the comparison is reversed and will generate CRYPTIC error messages:
as(1L==coverage(gr),'GRanges')
Error in function (classes, fdef, mtable) (from #3) :
unable to find an inherited method for function "Rle", for signature "NULL", "missing"
Prompting me to change my setAs thusly:
setAs('RleList','GRanges',
function(from) {
if (is.null(names(from))) stop(
'\nHey, the parameter `from` is un-named, thus the segnames of any hoped-for `GRanges` are un-assignable!\n'
,'Possible workaround: reverse the order of a comparison which produced `from` (if any)!\n'
,'Example: re-write "1L==coverage(gr)" as "coverage(gr)==1\n'
,'Best: fix the up-stream bug that `==` is not symetric w.r.t. names'
)
as(as(from,'IRangesList'),'GRanges')
})
Or, or, or?
~Malcolm
.H.
.
.>
.> Best,
.>
.> ~Malcolm
.>
.>
.>> -----Original Message-----
.>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Hervé Pagès
.>> Sent: Thursday, December 13, 2012 12:03 PM
.>> To: Hermann Norpois
.>> Cc: Michael Lawrence; bioconductor at r-project.org
.>> Subject: Re: [BioC] Exclude overlapping intervals
.>>
.>> Hi Hermann,
.>>
.>> On 12/13/2012 09:33 AM, Hermann Norpois wrote:
.>>> As both methods did not work ...
.>>>
.>>> 1) > gr[!gr %in% excluded] gr[!gr %in% excluded]
.>>> Fehler in gr[!gr %in% excluded] :
.>>> Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl
.>>> für Funktion '[': Fehler in gr %in% excluded :
.>>> Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl
.>>> für Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden
.>>
.>> My limited knowledge of Goethe's tongue tells me that the 'excluded'
.>> object was not found. Of course you need to provide a GRanges object,
.>> call it 'excluded' or 'gr2', that contains the ranges that you want to
.>> remove from 'gr' (based on overlaps).
.>>
.>> Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that
.>> overlap with other ranges in 'gr' would not produce what you want
.>> though, because that would exclude everything:
.>>
.>> > gr[!gr %in% gr]
.>> GRanges with 0 ranges and 2 metadata columns:
.>> seqnames ranges strand | score GC
.>> <Rle> <IRanges> <Rle> | <integer> <numeric>
.>> ---
.>> seqlengths:
.>> chr1 chr2 chr3
.>> 1000 2000 1500
.>>
.>> Try this instead:
.>>
.>> gr[countOverlaps(gr, gr) <= 1L]
.>>
.>> Cheers,
.>> H.
.>>
.>>>
.>>>> setdiff (gr)
.>>> Fehler in as.vector(x) :
.>>> Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln
.>>>
.>>> ... I am posting my data:
.>>>
.>>>> gr
.>>> GRanges with 5 ranges and 0 metadata columns:
.>>> seqnames ranges strand
.>>> <Rle> <IRanges> <Rle>
.>>> [1] chr1 [ 4, 10] *
.>>> [2] chr1 [ 12, 19] *
.>>> [3] chr1 [ 45, 80] *
.>>> [4] chr1 [ 55, 100] *
.>>> [5] chr1 [105, 200] *
.>>> ---
.>>> seqlengths:
.>>> chr1
.>>> NA
.>>>> dput (gr)
.>>> new("GRanges"
.>>> , seqnames = new("Rle"
.>>> , values = structure(1L, .Label = "chr1", class = "factor")
.>>> , lengths = 5L
.>>> , elementMetadata = NULL
.>>> , metadata = list()
.>>> )
.>>> , ranges = new("IRanges"
.>>> , start = c(4L, 12L, 45L, 55L, 105L)
.>>> , width = c(7L, 8L, 36L, 46L, 96L)
.>>> , NAMES = NULL
.>>> , elementType = "integer"
.>>> , elementMetadata = NULL
.>>> , metadata = list()
.>>> )
.>>> , strand = new("Rle"
.>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor")
.>>> , lengths = 5L
.>>> , elementMetadata = NULL
.>>> , metadata = list()
.>>> )
.>>> , elementMetadata = new("DataFrame"
.>>> , rownames = NULL
.>>> , nrows = 5L
.>>> , listData = structure(list(), .Names = character(0))
.>>> , elementType = "ANY"
.>>> , elementMetadata = NULL
.>>> , metadata = list()
.>>> )
.>>> , seqinfo = new("Seqinfo"
.>>> , seqnames = "chr1"
.>>> , seqlengths = NA_integer_
.>>> , is_circular = NA
.>>> , genome = NA_character_
.>>> )
.>>> , metadata = list()
.>>> )
.>>>
.>>> Thanks
.>>> Hermann
.>>>
.>>> 2012/12/13 Michael Lawrence <lawrence.michael at gene.com>
.>>>
.>>>> Something like:
.>>>>
.>>>> gr[!gr %in% excluded]
.>>>>
.>>>> Another option is setdiff() but note that will generate a new set of
.>>>> sorted, non-overlapping, non-adjacent ("normal") ranges.
.>>>>
.>>>> Michael
.>>>>
.>>>>
.>>>> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at googlemail.com>wrote:
.>>>>
.>>>>> Hello,
.>>>>>
.>>>>> having a GRanges object I was looking for a function to exclude all
.>>>>> overlapping intervals. So I get exclusively intervals that do not overlap.
.>>>>> But I did not find a proper function. Has anybody an idea?
.>>>>> Thanks
.>>>>> Hermann
.>>>>>
.>>>>> [[alternative HTML version deleted]]
.>>>>>
.>>>>> _______________________________________________
.>>>>> Bioconductor mailing list
.>>>>> Bioconductor at r-project.org
.>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
.>>>>> Search the archives:
.>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
.>>>>>
.>>>>
.>>>>
.>>>
.>>> [[alternative HTML version deleted]]
.>>>
.>>>
.>>>
.>>> _______________________________________________
.>>> Bioconductor mailing list
.>>> Bioconductor at r-project.org
.>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
.>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
.>>>
.>>
.>> --
.>> Hervé Pagès
.>>
.>> Program in Computational Biology
.>> Division of Public Health Sciences
.>> Fred Hutchinson Cancer Research Center
.>> 1100 Fairview Ave. N, M1-B514
.>> P.O. Box 19024
.>> Seattle, WA 98109-1024
.>>
.>> E-mail: hpages at fhcrc.org
.>> Phone: (206) 667-5791
.>> Fax: (206) 667-1319
.>>
.>> _______________________________________________
.>> Bioconductor mailing list
.>> Bioconductor at r-project.org
.>> https://stat.ethz.ch/mailman/listinfo/bioconductor
.>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
.
.--
.Hervé Pagès
.
.Program in Computational Biology
.Division of Public Health Sciences
.Fred Hutchinson Cancer Research Center
.1100 Fairview Ave. N, M1-B514
.P.O. Box 19024
.Seattle, WA 98109-1024
.
.E-mail: hpages at fhcrc.org
.Phone: (206) 667-5791
.Fax: (206) 667-1319
More information about the Bioconductor
mailing list