[Bioc-devel] findOverlaps and mclapply
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Fri Jul 6 17:51:14 CEST 2012
This is about the findOverlaps method for (query = "GenomicRanges",
subject = "GenomicRanges") in GenomicRanges.
This function is really slow when the set of distinct seqnames is
really big. Example
library(BSgenome.Amellifera.BeeBase.assembly4)
Un <- Amellifera$GroupUn
gr <- GRanges(seqnames = names(Un),
ranges= IRanges(start = 1 , width = width(Un)))
length(gr)
## Only 9244 in length
system.time(findOverlaps(gr[1], gr))
user system elapsed
297.202 0.021 297.279
Pretty slow for finding overlaps between a Granges with length 1 and a
Granges with length roughly 10000.
This is because the function essentially does an lapply over distinct
seqnames. I raised this issue a while ago, and Michael said that he
might consider building an IntervalTree over both seqnames and ranges.
So this is really only an issue in organisms with many small contigs.
However, in the mean time, I would appreciate making the function
mclapply-aware, which is pretty simple and at least makes the function
roughly #cores faster.
My fix (which I have been using for a while) adds
mc.cores = 1, mc.preschedule = TRUE
to the specific method as well as
matchMatrix <- do.call(rbind, mclapply(commonSeqnames,
function(seqnm) {
<SNIP>
}, mc.cores = mc.cores,
mc.preschedule = mc.preschedule))
inside the methods definition. Also needed is making the package
depend on parallel.
I am raising this issue because I am submitting a package containing
this fix, and I felt it might be good to propagate. I can commit it
to subversion as well, if there is any interest.
Best,
Kasper
More information about the Bioc-devel
mailing list