[Bioc-devel] coverage as IntegerList

Hervé Pagès hpages at fhcrc.org
Wed Feb 12 03:58:38 CET 2014


Hi,

Why not. But I don't expect a significant speed up. Here is why:

There are actually 2 algos implemented by coverage(): one called "sort"
that computes the coverage directly into "Rle space", and one called
"hash" that computes the coverage into an ordinary integer vector and
turns this vector into an Rle at the end (this conversion is cheap).

By default coverage() tries to automatically pick up the appropriate
algo: "hash" when the data are dense, "sort" otherwise. The criteria
used to decide whether the data are dense or not is a little bit
naive (and could maybe be improved?): it just compares the number
of ranges in the input with the length of the coverage vector to
return. If nb of ranges > 0.25 * length-of-coverage-vector, the data
is considered to be dense. Clearly this formula is kind of arbitrary
and I'm sure it could be tweaked a little bit to do a better job.

Note that the user can choose the algo to use via the 'method' arg.
If you know your data are dense, use method="hash". It will be almost
as fast as if coverage() was returning an IntegerList, except that
the coverage is turned into an Rle (but only at the end). I would
expect this final coercion to be nothing compared to the computation
of the coverage itself. This would need to be confirmed by some
profiling though.

Anyway maybe there are other benefits of returning an IntegerList:
smaller memory footprint when the data are dense,
more beginner-friendly container, maybe slightly faster
downstream computations (can this be a bottleneck?), others?

H.


On 02/11/2014 05:06 PM, Michael Lawrence wrote:
> Right, it would be a choice. The compression is not worth it when the data
> are dense.
>
>
> On Tue, Feb 11, 2014 at 4:18 PM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> Sounds reasonable, _especially_ if you think it is faster.  You're the
>> expert.  I assume you will allow the user to choose the return value?
>>   Having the option of Rle's is still nice, for some use cases.
>>
>>
>> On Tue, Feb 11, 2014 at 7:12 PM, Michael Lawrence <
>> lawrence.michael at gene.com> wrote:
>>
>>> Just a thought: support coverage calculation directly to IntegerList. Will
>>> very often be faster than RleList, especially when limiting to regions
>>> without long runs of zeros, and with WGS data.
>>>
>>> Something to put on the TODO list?
>>>
>>> Michael
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list