[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Thu Oct 17 22:07:59 CEST 2013

Hi Stephanie,

On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
> What is the best way to compute the coverage on a ReadGAlignments object
> when I have many, many (about 50 000) seqlevels?
>
> Just computing coverage(object) is terribly slow due to the huge amount of seqlevels.
>
> At the moment, I am subsetting the ReadGAlignments object for each seqlevel,
>   I omit the redundant seqlevels and calculate the coverage.
> This is fast if you do it for a few seqlevels.
> But if one really would like to compute the coverage for each seqlevel individually,
> this really takes a long time..

Yes this is a known weakness in the current implementation of
coverage(). It loops over the seqlevels and this loop is currently
performed in R, which is slow. Unfortunately this is not the first
time someone complains about this.

Time for us to bite the bullet I guess. I've just started to work
on moving the loop to the C-level. That should speed up things
significantly. I'll let you know when this is ready.

Thanks for the feedback.
H.

>
>
> I would appreciate any comments!
>
> Best,
> Stefanie
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319