[Bioc-devel] BiocParallel
Ryan C. Thompson
rct at thompsonclan.org
Fri Nov 16 00:06:27 CET 2012
You can probably parallelize the findOverlaps function, but you'd have
to write the code yourself, and that code would be mostly bookkeeping
code to get the indices right. Maybe there's a case for adding a
parallelized findOverlaps function to BiocParallel?
You can't parallelize the disjoin operation with something like
"mclapply", since it is not a data-parallel operation. Maybe you could
speed things up by writing a recursive version of disjoin which splits
its argument into subsets, runs disjoin on each one, and then calls
disjoin on the results. However, I'm not sure if this would actually
result in a speedup in practice. More naively, if your arguments are
GRanges, you can split by chromosome and run disjoin on each chromosome
in parallel, then merge the results. But that will also put things out
of order in case you care.
You can probably parallelize subsetByOverlaps using pvec, but you might
have to ignore the warning about the output length not being the same as
the input.
Your for loop can be parallelized as such:
overlapping.byState <- mclapply(byState, function(x)
which(queryHits(findOverlaps(disjoint, x))))
mcols(disjoint)[unlist(overlapping.byState), "state"] <-
factor(rep(names(overlapping.byState),
elementLengths(overlapping.byState)))
The above all assumes you are using the pvec and mclapply from this new
BiocParallel package which supports operations on non-primitive
vector-ish objects.
On 11/15/2012 11:02 AM, Tim Triche, Jr. wrote:
> As an aside, if I want to do the following:
>
> ol <- findOverlaps(object, x)
> so <- object[queryHits(ol)]
> sx <- x[subjectHits(ol)]
> disjoint <- subsetByOverlaps(disjoin(c(sx, so, ignore.mcols = T)),
> so)
> mcols(disjoint)[, "state"] <- rep("", length(disjoint))
> byState <- split(so, mcols(so)[, "state"])
> for (state in names(byState)) {
> overlapping <- queryHits(findOverlaps(disjoint,
> byState[[state]]))
> if (length(overlapping) > 0) mcols(disjoint[overlapping])[,
> "state"] <- state
> }
> mcols(disjoint)[, "state"] <- as.factor(mcols(disjoint)[, "state"])
>
> often and fast, where 'object' and 'x' are ranges with large numbers of
> intervals, is there a clever way to speed it up a lot?
>
More information about the Bioc-devel
mailing list