[Bioc-devel] findOverlaps() news
Hervé Pagès
hpages at fredhutch.org
Fri Dec 5 22:17:56 CET 2014
Developers,
In BioC 3.1 the code behind overlap operations like findOverlaps(),
countOverlaps(), subsetByOverlaps(), summarizeOverlaps(), nearest(),
etc... has been refactored and improved. Some highlights on what has
changed:
- The underlying code used for finding/counting overlaps is now based
on the beautiful Nested Containment List algorithm by Alexander V.
Alekseyenko1 and Christopher J. Lee2 [1].
- The old algorithm based on interval trees is still available. The
'algorithm' argument was added to most overlap-based operations to
let the user choose between the new (algorithm="nclist", the default)
and the old (algorithm="intervaltree") algorithm. Except for the 3
particular situations mentioned below, choosing one or the other
doesn't affect the output, only the performance.
- Either the query or subject can be preprocessed with NCList() for
a Ranges object (replacement for IntervalTree()), and GNCList()
for a GenomicRanges object (replacement for GIntervalTree()).
For a one time use, it's not advised to explicitely preprocess the
input. This is because findOverlaps() or countOverlaps() will take
care of it and do a better job at it (that is, they preprocess only
what's needed when it's needed and release memory as they go).
- GNCList() supports preprocessing of a GenomicRanges object with
ranges located on a circular sequence.
- With the new algorithm, countOverlaps() on Ranges or GenomicRanges
objects doesn't call findOverlaps() to collect all the hits in a
growing Hits object and count them only at the end. Instead the
counting happens at the C level and the hits are not kept. This
reduces memory usage considerably when there is a lot of hits.
- Zero-width ranges are now interpreted as insertion points and are
considered to overlap with ranges that contain them. This is the
1st situation where using 'algorithm="nclist"' or
'algorithm="intervaltree"' produces different output.
- When using 'select="arbitrary"', the new algorithm will generally
not select the same hits as the old algorithm. This is the 2nd
situation where using 'algorithm="nclist"' or
'algorithm="intervaltree"' produces different output.
- When using the old interval tree algorithm, 'maxgap' has a special
meaning if 'type' is "start", "end", or "within". This is not yet
the case with the new algorithm. That feature seems somewhat useful
though so maybe the new algorithm should also support it? Anyway,
this is the 3rd situation where using 'algorithm="nclist"' or
'algorithm="intervaltree"' produces different output.
- Objects preprocessed with NCList() or GNCList() are serializable.
Performance:
In BioC 3.1 overlaps operations like findOverlaps() or
summarizeOverlaps() on GRanges and/or GRangesList objects are
typically between 3x and 10x faster than in BioC 3.0 for a medium
size data set (e.g. 25 million reads). Memory usage is also reduced
by 25% or more. These numbers may vary but the bigger the data set,
the better they will be.
This is with IRanges 2.1.29 and GenomicRanges 1.19.21 (will become
available tomorrow).
Some remaining tasks:
(a) Bring back special meaning of 'maxgap' when 'type' is "start",
"end", or "within".
(b) Update documentation.
(c) Enable user interrupts.
(d) GNCLists for prepreocessing a GRangesList object (just a
GNCList inside a CompressedList).
(e) Use threads (OpenMP) for even better performance (to be
implemented in an add-on package).
Even though a lot of time and effort was spent in testing the new
code and making sure it won't crash your session, you might run into
problems. Please report.
Enjoy and let me know if you have questions or comments about this.
Thanks!
H.
[1] Alexander V. Alekseyenko1 and Christopher J. Lee2.
Nested Containment List (NCList): a new algorithm for accelerating
interval query of genome alignment and interval databases.
Bioinformatics (2007) 23 (11): 1386-1393.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list