[Bioc-devel] findOverlaps() news

Hervé Pagès hpages at fredhutch.org
Fri Dec 5 22:17:56 CET 2014


Developers,

In BioC 3.1 the code behind overlap operations like findOverlaps(),
countOverlaps(), subsetByOverlaps(), summarizeOverlaps(), nearest(),
etc... has been refactored and improved. Some highlights on what has
changed:

- The underlying code used for finding/counting overlaps is now based
   on the beautiful Nested Containment List algorithm by Alexander V.
   Alekseyenko1 and Christopher J. Lee2 [1].

- The old algorithm based on interval trees is still available. The
   'algorithm' argument was added to most overlap-based operations to
   let the user choose between the new (algorithm="nclist", the default)
   and the old (algorithm="intervaltree") algorithm. Except for the 3
   particular situations mentioned below, choosing one or the other
   doesn't affect the output, only the performance.

- Either the query or subject can be preprocessed with NCList() for
   a Ranges object (replacement for IntervalTree()), and GNCList()
   for a GenomicRanges object (replacement for GIntervalTree()).
   For a one time use, it's not advised to explicitely preprocess the
   input. This is because findOverlaps() or countOverlaps() will take
   care of it and do a better job at it (that is, they preprocess only
   what's needed when it's needed and release memory as they go).

- GNCList() supports preprocessing of a GenomicRanges object with
   ranges located on a circular sequence.

- With the new algorithm, countOverlaps() on Ranges or GenomicRanges
   objects doesn't call findOverlaps() to collect all the hits in a
   growing Hits object and count them only at the end. Instead the
   counting happens at the C level and the hits are not kept. This
   reduces memory usage considerably when there is a lot of hits.

- Zero-width ranges are now interpreted as insertion points and are
   considered to overlap with ranges that contain them. This is the
   1st situation where using 'algorithm="nclist"' or
   'algorithm="intervaltree"' produces different output.

- When using 'select="arbitrary"', the new algorithm will generally
   not select the same hits as the old algorithm. This is the 2nd
   situation where using 'algorithm="nclist"' or
   'algorithm="intervaltree"' produces different output.

- When using the old interval tree algorithm, 'maxgap' has a special
   meaning if 'type' is "start", "end", or "within". This is not yet
   the case with the new algorithm. That feature seems somewhat useful
   though so maybe the new algorithm should also support it? Anyway,
   this is the 3rd situation where using 'algorithm="nclist"' or
   'algorithm="intervaltree"' produces different output.

- Objects preprocessed with NCList() or GNCList() are serializable.

Performance:

In BioC 3.1 overlaps operations like findOverlaps() or
summarizeOverlaps() on GRanges and/or GRangesList objects are
typically between 3x and 10x faster than in BioC 3.0 for a medium
size data set (e.g. 25 million reads). Memory usage is also reduced
by 25% or more. These numbers may vary but the bigger the data set,
the better they will be.

This is with IRanges 2.1.29 and GenomicRanges 1.19.21 (will become
available tomorrow).

Some remaining tasks:

   (a) Bring back special meaning of 'maxgap' when 'type' is "start",
       "end", or "within".

   (b) Update documentation.

   (c) Enable user interrupts.

   (d) GNCLists for prepreocessing a GRangesList object (just a
       GNCList inside a CompressedList).

   (e) Use threads (OpenMP) for even better performance (to be
       implemented in an add-on package).

Even though a lot of time and effort was spent in testing the new
code and making sure it won't crash your session, you might run into
problems. Please report.

Enjoy and let me know if you have questions or comments about this.

Thanks!
H.

[1] Alexander V. Alekseyenko1 and Christopher J. Lee2.
     Nested Containment List (NCList): a new algorithm for accelerating
     interval query of genome alignment and interval databases.
     Bioinformatics (2007) 23 (11): 1386-1393.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list