[Bioc-devel] findOverlaps() news

Mon Dec 8 21:16:09 CET 2014

Nice job, Herve!

-steve

On Fri, Dec 5, 2014 at 1:17 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Developers,
>
> In BioC 3.1 the code behind overlap operations like findOverlaps(),
> countOverlaps(), subsetByOverlaps(), summarizeOverlaps(), nearest(),
> etc... has been refactored and improved. Some highlights on what has
> changed:
>
> - The underlying code used for finding/counting overlaps is now based
>   on the beautiful Nested Containment List algorithm by Alexander V.
>   Alekseyenko1 and Christopher J. Lee2 [1].
>
> - The old algorithm based on interval trees is still available. The
>   'algorithm' argument was added to most overlap-based operations to
>   let the user choose between the new (algorithm="nclist", the default)
>   and the old (algorithm="intervaltree") algorithm. Except for the 3
>   particular situations mentioned below, choosing one or the other
>   doesn't affect the output, only the performance.
>
> - Either the query or subject can be preprocessed with NCList() for
>   a Ranges object (replacement for IntervalTree()), and GNCList()
>   for a GenomicRanges object (replacement for GIntervalTree()).
>   For a one time use, it's not advised to explicitely preprocess the
>   input. This is because findOverlaps() or countOverlaps() will take
>   care of it and do a better job at it (that is, they preprocess only
>   what's needed when it's needed and release memory as they go).
>
> - GNCList() supports preprocessing of a GenomicRanges object with
>   ranges located on a circular sequence.
>
> - With the new algorithm, countOverlaps() on Ranges or GenomicRanges
>   objects doesn't call findOverlaps() to collect all the hits in a
>   growing Hits object and count them only at the end. Instead the
>   counting happens at the C level and the hits are not kept. This
>   reduces memory usage considerably when there is a lot of hits.
>
> - Zero-width ranges are now interpreted as insertion points and are
>   considered to overlap with ranges that contain them. This is the
>   1st situation where using 'algorithm="nclist"' or
>   'algorithm="intervaltree"' produces different output.
>
> - When using 'select="arbitrary"', the new algorithm will generally
>   not select the same hits as the old algorithm. This is the 2nd
>   situation where using 'algorithm="nclist"' or
>   'algorithm="intervaltree"' produces different output.
>
> - When using the old interval tree algorithm, 'maxgap' has a special
>   meaning if 'type' is "start", "end", or "within". This is not yet
>   the case with the new algorithm. That feature seems somewhat useful
>   though so maybe the new algorithm should also support it? Anyway,
>   this is the 3rd situation where using 'algorithm="nclist"' or
>   'algorithm="intervaltree"' produces different output.
>
> - Objects preprocessed with NCList() or GNCList() are serializable.
>
> Performance:
>
> In BioC 3.1 overlaps operations like findOverlaps() or
> summarizeOverlaps() on GRanges and/or GRangesList objects are
> typically between 3x and 10x faster than in BioC 3.0 for a medium
> size data set (e.g. 25 million reads). Memory usage is also reduced
> by 25% or more. These numbers may vary but the bigger the data set,
> the better they will be.
>
> This is with IRanges 2.1.29 and GenomicRanges 1.19.21 (will become
> available tomorrow).
>
> Some remaining tasks:
>
>   (a) Bring back special meaning of 'maxgap' when 'type' is "start",
>       "end", or "within".
>
>   (b) Update documentation.
>
>   (c) Enable user interrupts.
>
>   (d) GNCLists for prepreocessing a GRangesList object (just a
>       GNCList inside a CompressedList).
>
>   (e) Use threads (OpenMP) for even better performance (to be
>       implemented in an add-on package).
>
> Even though a lot of time and effort was spent in testing the new
> code and making sure it won't crash your session, you might run into
> problems. Please report.
>
> Enjoy and let me know if you have questions or comments about this.
>
> Thanks!
> H.
>
> [1] Alexander V. Alekseyenko1 and Christopher J. Lee2.
>     Nested Containment List (NCList): a new algorithm for accelerating
>     interval query of genome alignment and interval databases.
>     Bioinformatics (2007) 23 (11): 1386-1393.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Steve Lianoglou
Computational Biologist
Genentech