[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels
Stefanie Tauber
stefanie.tauber at univie.ac.at
Mon Oct 28 13:11:51 CET 2013
Hi Hervé,
I just wanted to try the new coverage function:
# I read in a bam file, human data mapped against the transcriptome
x = readGAlignments(human_trans_bam)
# This is my GAligments object
> x
GAlignments with 46765032 alignments and 0 metadata columns:
seqnames strand cigar qwidth
<Rle> <Rle> <character> <integer>
[1] hg19_ensGene_ENST00000237247 + 75M 75
[2] hg19_ensGene_ENST00000371036 - 75M 75
[3] hg19_ensGene_ENST00000371037 - 75M 75
[4] hg19_ensGene_ENST00000471889 - 75M 75
[5] hg19_ensGene_ENST00000377479 + 75M 75
... ... ... ... ...
[46765028] hg19_ensGene_ENST00000344867 - 75M 75
[46765029] hg19_ensGene_ENST00000400848 - 75M 75
[46765030] hg19_ensGene_ENST00000400848 - 75M 75
[46765031] hg19_ensGene_ENST00000400848 - 75M 75
[46765032] hg19_ensGene_ENST00000400829 - 75M 75
start end width ngap
<integer> <integer> <integer> <integer>
[1] 1783 1857 75 0
[2] 690 764 75 0
[3] 1632 1706 75 0
[4] 2121 2195 75 0
[5] 391 465 75 0
... ... ... ... ...
[46765028] 280 354 75 0
[46765029] 509 583 75 0
[46765030] 509 583 75 0
[46765031] 509 583 75 0
[46765032] 179 253 75 0
---
seqlengths:
hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
2580 ... 999
When I want to compute the coverage, I get the following:
> cov_x = coverage(x)
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object '.Ranges.coverage' not found
Do you know what I am doing wrong?
Thanks,
Stefanie
On Fr, 25.10.2013, 11:44, Hervé Pagès wrote:
> Hi Stephanie,
>
> Just to let you know that in GenomicRanges 1.14.2, coverage() should
> now be much faster on objects with many seqlevels. Should be about
> 1000x faster or more if you have 50 000 seqlevels.
>
> The only case where it's still slow is if many seqlevels are marked as
> circular (and have a non-NA seqlength) in seqinfo(x), because for
> these seqlevels the coverage vector is still post processed by a
> loop in R. Could be a problem if for example people have BAM files
> with reads aligned to thousands of bacterial genomes.
>
> GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
> biocLite() in the next 24 hours or so. The man pages for the "coverage"
> methods have been improved, especially in IRanges.
>
> Cheers,
> H.
>
>
> On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>> Thanks for the answer!
>> Just wanted to make sure that I don't miss a relevant feature….
>>
>> Best,
>> Stefanie
>>
>> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>>:
>>
>>> Hi Stephanie,
>>>
>>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>>>> What is the best way to compute the coverage on a ReadGAlignments
>>>> object
>>>> when I have many, many (about 50 000) seqlevels?
>>>>
>>>> Just computing coverage(object) is terribly slow due to the huge
>>>> amount of seqlevels.
>>>>
>>>> At the moment, I am subsetting the ReadGAlignments object for each
>>>> seqlevel,
>>>> I omit the redundant seqlevels and calculate the coverage.
>>>> This is fast if you do it for a few seqlevels.
>>>> But if one really would like to compute the coverage for each
>>>> seqlevel individually,
>>>> this really takes a long time..
>>>
>>> Yes this is a known weakness in the current implementation of
>>> coverage(). It loops over the seqlevels and this loop is currently
>>> performed in R, which is slow. Unfortunately this is not the first
>>> time someone complains about this.
>>>
>>> Time for us to bite the bullet I guess. I've just started to work
>>> on moving the loop to the C-level. That should speed up things
>>> significantly. I'll let you know when this is ready.
>>>
>>> Thanks for the feedback.
>>> H.
>>>
>>>>
>>>>
>>>> I would appreciate any comments!
>>>>
>>>> Best,
>>>> Stefanie
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
More information about the Bioconductor
mailing list