[Bioc-devel] findOverlaps and mclapply

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Jul 6 17:51:14 CEST 2012


This is about the findOverlaps method for (query = "GenomicRanges",
subject = "GenomicRanges") in GenomicRanges.

This function is really slow when the set of distinct seqnames is
really big.  Example

library(BSgenome.Amellifera.BeeBase.assembly4)
Un <- Amellifera$GroupUn
gr <- GRanges(seqnames = names(Un),
              ranges= IRanges(start = 1 , width = width(Un)))
length(gr)

## Only 9244 in length

system.time(findOverlaps(gr[1], gr))
  user  system elapsed
 297.202   0.021 297.279

Pretty slow for finding overlaps between a Granges with length 1 and a
Granges with length roughly 10000.

This is because the function essentially does an lapply over distinct
seqnames.  I raised this issue a while ago, and Michael said that he
might consider building an IntervalTree over both seqnames and ranges.
 So this is really only an issue in organisms with many small contigs.
 However, in the mean time, I would appreciate making the function
mclapply-aware, which is pretty simple and at least makes the function
roughly #cores faster.

My fix (which I have been using for a while) adds
  mc.cores = 1, mc.preschedule = TRUE
to the specific method as well as

        matchMatrix <- do.call(rbind, mclapply(commonSeqnames,
                                               function(seqnm) {
<SNIP>
                                               }, mc.cores = mc.cores,
mc.preschedule = mc.preschedule))

inside the methods definition.  Also needed is making the package
depend on parallel.

I am raising this issue because I am submitting a package containing
this fix, and I felt it might be good to propagate.  I can commit it
to subversion as well, if there is any interest.

Best,
Kasper



More information about the Bioc-devel mailing list