[Bioc-devel] improvement to findOverlap for GRanges

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Sun Nov 27 02:48:42 CET 2011


On Sat, Nov 26, 2011 at 6:58 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> As an ultimate goal, it make sense to vectorize things over many seqnames.
> The IntervalTree could also be made to store data over multiple spaces (like
> seqname, strand combinations) and could then be cached within a
> GenomicRanges object. That way, the tree is only built once.

This would, of course, be awesome.

One other piece of info is that organisms with many contigs typically
have shorter contigs.  One could perhaps start by ignoring the
seqnames, do a findOverlaps on the underlying ranges and then use
seqnames afterwards to sort it out.  Not sure whether this is a good
idea.

Kasper

>
> Michael
>
> On Sat, Nov 26, 2011 at 1:55 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>>
>> In my experience, GRanges are very convenient and amazingly fast.
>>
>> I have recently started to work on an organism with a more unfinished
>> genome.  Instead of the situation for model organisms with high
>> quality reference genomes where we have few, very long, chromosomes,
>> the organism I am working with has many, much smaller, contigs.  By
>> many I mean on the order of 10,000.  Given how hard it is to currently
>> finish a genome to high quality, we are likely to see many organisms
>> with such fragmented reference genomes.
>>
>> findOverlaps is quite slow in such situations.  Essentially
>> findOverlaps splits the GRanges by the seqnames and then does a lapply
>> over the (unique) seqnames (with a findOverlaps).  Example:
>>
>> library(GenomicRanges)
>>
>> simul <- function(nSamples = 10^5, nChr = 1000) {
>>    start <- sample(1:10^5, size = nSamples, replace = TRUE)
>>    chr <- paste("chr", sample(1:nChr, size = nSamples, replace =
>> TRUE), sep = "")
>>    gr <- GRanges(seqnames = chr, ranges = IRanges(start = start, width =
>> 1000))
>>    gr
>> }
>>
>> set.seed(1234)
>> gr1 <- simul()
>> gr2 <- simul()
>> gr2.1 <- gr2[1]
>> seqlevels(gr2.1) <- as.character(seqnames(gr2.1))
>> system.time({
>>    findOverlaps(gr1, gr2)
>> })
>> system.time({
>>    findOverlaps(gr1, gr2[1])
>> })
>> system.time({
>>    findOverlaps(gr1, gr2.1)
>> })
>> source("findOverlaps-methods.R")
>>
>> On my laptop the three calls take roughly 19.7sec, 16.1sec and
>> 0.07sec.  The difference between the last two calls is unnecessary and
>> arises because findOverlaps ends up sapply over many empty seqnames.
>> A fix is
>>
>> Index: findOverlaps-methods.R
>> ===================================================================
>> --- findOverlaps-methods.R      (revision 60860)
>> +++ findOverlaps-methods.R      (working copy)
>> @@ -71,11 +71,11 @@
>>         } else {
>>             querySeqnames <- seqnames(query)
>>             querySplitRanges <- splitRanges(querySeqnames)
>> -            uniqueQuerySeqnames <- names(querySplitRanges)
>> +            uniqueQuerySeqnames <-
>> names(querySplitRanges)[sapply(querySplitRanges, length) > 0]
>>
>>             subjectSeqnames <- seqnames(subject)
>>             subjectSplitRanges <- splitRanges(subjectSeqnames)
>> -            uniqueSubjectSeqnames <- names(subjectSplitRanges)
>> +            uniqueSubjectSeqnames <-
>> names(subjectSplitRanges)[sapply(subjectSplitRanges, length) > 0]
>>
>>             commonSeqnames <-
>>               intersect(uniqueQuerySeqnames, uniqueSubjectSeqnames)
>>
>> I am guessing that a fix like this might also make sense for coverage.
>>
>> However, this does not solve the "slowness" of findOverlaps for
>> GRanges like gr1 and gr2 above (many seqlevels, but also long).
>> Perhaps there is some obvious way to deal with this?  It would be nice
>> if GRanges are also fast with many seqlevels.
>>
>> Kasper
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>



More information about the Bioc-devel mailing list