[Bioc-devel] BiocParallel

Fri Nov 16 00:06:27 CET 2012

You can probably parallelize the findOverlaps function, but you'd have 
to write the code yourself, and that code would be mostly bookkeeping 
code to get the indices right. Maybe there's a case for adding a 
parallelized findOverlaps function to BiocParallel?

You can't parallelize the disjoin operation with something like 
"mclapply", since it is not a data-parallel operation. Maybe you could 
speed things up by writing a recursive version of disjoin which splits 
its argument into subsets, runs disjoin on each one, and then calls 
disjoin on the results. However, I'm not sure if this would actually 
result in a speedup in practice. More naively, if your arguments are 
GRanges, you can split by chromosome and run disjoin on each chromosome 
in parallel, then merge the results. But that will also put things out 
of order in case you care.

You can probably parallelize subsetByOverlaps using pvec, but you might 
have to ignore the warning about the output length not being the same as 
the input.

Your for loop can be parallelized as such:

overlapping.byState <- mclapply(byState, function(x) 
which(queryHits(findOverlaps(disjoint, x))))
mcols(disjoint)[unlist(overlapping.byState), "state"] <-
   factor(rep(names(overlapping.byState), 
elementLengths(overlapping.byState)))

The above all assumes you are using the pvec and mclapply from this new 
BiocParallel package which supports operations on non-primitive 
vector-ish objects.

On 11/15/2012 11:02 AM, Tim Triche, Jr. wrote:
> As an aside, if I want to do the following:
>
>          ol <- findOverlaps(object, x)
>          so <- object[queryHits(ol)]
>          sx <- x[subjectHits(ol)]
>          disjoint <- subsetByOverlaps(disjoin(c(sx, so, ignore.mcols = T)),
>              so)
>          mcols(disjoint)[, "state"] <- rep("", length(disjoint))
>          byState <- split(so, mcols(so)[, "state"])
>          for (state in names(byState)) {
>              overlapping <- queryHits(findOverlaps(disjoint,
> byState[[state]]))
>              if (length(overlapping) > 0) mcols(disjoint[overlapping])[,
> "state"] <- state
>          }
>          mcols(disjoint)[, "state"] <- as.factor(mcols(disjoint)[, "state"])
>
> often and fast, where 'object' and 'x' are ranges with large numbers of
> intervals, is there a clever way to speed it up a lot?
>