[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Mon Oct 28 15:00:53 CET 2013

Here is my SessionInfo,
GenomicRanges as well as IRanges should be up-to-date.

R Under development (unstable) (2013-10-15 r64056)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rsamtools_1.15.2       Biostrings_2.31.0      GenomicFeatures_1.15.2
[4] AnnotationDbi_1.25.2   Biobase_2.23.1         GenomicRanges_1.15.1
[7] XVector_0.3.0          IRanges_1.21.5         BiocGenerics_0.9.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.19.0     bitops_1.0-6       BSgenome_1.31.0    DBI_0.2-7
 [5] RCurl_1.95-4.1     RSQLite_0.11.4     rtracklayer_1.23.1 stats4_3.1.0
 [9] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.9.0

Best,
Stefanie

On Mo, 28.10.2013, 14:56, Michael Lawrence wrote:
> Probably need to update GenomicRanges.
>
>
> On Mon, Oct 28, 2013 at 5:11 AM, Stefanie Tauber <
> stefanie.tauber at univie.ac.at> wrote:
>
>> Hi Hervé,
>>
>> I just wanted to try the new coverage function:
>>
>> # I read in a bam file, human data mapped against the transcriptome
>> x = readGAlignments(human_trans_bam)
>>
>> # This is my GAligments object
>> > x
>> GAlignments with 46765032 alignments and 0 metadata columns:
>>                                  seqnames strand       cigar    qwidth
>>                                     <Rle>  <Rle> <character> <integer>
>>          [1] hg19_ensGene_ENST00000237247      +         75M        75
>>          [2] hg19_ensGene_ENST00000371036      -         75M        75
>>          [3] hg19_ensGene_ENST00000371037      -         75M        75
>>          [4] hg19_ensGene_ENST00000471889      -         75M        75
>>          [5] hg19_ensGene_ENST00000377479      +         75M        75
>>          ...                          ...    ...         ...       ...
>>   [46765028] hg19_ensGene_ENST00000344867      -         75M        75
>>   [46765029] hg19_ensGene_ENST00000400848      -         75M        75
>>   [46765030] hg19_ensGene_ENST00000400848      -         75M        75
>>   [46765031] hg19_ensGene_ENST00000400848      -         75M        75
>>   [46765032] hg19_ensGene_ENST00000400829      -         75M        75
>>                  start       end     width      ngap
>>              <integer> <integer> <integer> <integer>
>>          [1]      1783      1857        75         0
>>          [2]       690       764        75         0
>>          [3]      1632      1706        75         0
>>          [4]      2121      2195        75         0
>>          [5]       391       465        75         0
>>          ...       ...       ...       ...       ...
>>   [46765028]       280       354        75         0
>>   [46765029]       509       583        75         0
>>   [46765030]       509       583        75         0
>>   [46765031]       509       583        75         0
>>   [46765032]       179       253        75         0
>>   ---
>>   seqlengths:
>>    hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>>                            2580 ...                          999
>>
>> When I want to compute the coverage, I get the following:
>> > cov_x = coverage(x)
>>
>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>>   object '.Ranges.coverage' not found
>>
>> Do you know what I am doing wrong?
>>
>> Thanks,
>> Stefanie
>>
>>
>>
>> On Fr, 25.10.2013, 11:44, Hervé Pagès wrote:
>> > Hi Stephanie,
>> >
>> > Just to let you know that in GenomicRanges 1.14.2, coverage() should
>> > now be much faster on objects with many seqlevels. Should be about
>> > 1000x faster or more if you have 50 000 seqlevels.
>> >
>> > The only case where it's still slow is if many seqlevels are marked as
>> > circular (and have a non-NA seqlength) in seqinfo(x), because for
>> > these seqlevels the coverage vector is still post processed by a
>> > loop in R. Could be a problem if for example people have BAM files
>> > with reads aligned to thousands of bacterial genomes.
>> >
>> > GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
>> > biocLite() in the next 24 hours or so. The man pages for the
>> "coverage"
>> > methods have been improved, especially in IRanges.
>> >
>> > Cheers,
>> > H.
>> >
>> >
>> > On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>> >> Thanks for the answer!
>> >> Just wanted to make sure that I don't miss a relevant feature….
>> >>
>> >> Best,
>> >> Stefanie
>> >>
>> >> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>> >> <mailto:hpages at fhcrc.org>>:
>> >>
>> >>> Hi Stephanie,
>> >>>
>> >>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>> >>>> What is the best way to compute the coverage on a ReadGAlignments
>> >>>> object
>> >>>> when I have many, many (about 50 000) seqlevels?
>> >>>>
>> >>>> Just computing coverage(object) is terribly slow due to the huge
>> >>>> amount of seqlevels.
>> >>>>
>> >>>> At the moment, I am subsetting the ReadGAlignments object for each
>> >>>> seqlevel,
>> >>>>  I omit the redundant seqlevels and calculate the coverage.
>> >>>> This is fast if you do it for a few seqlevels.
>> >>>> But if one really would like to compute the coverage for each
>> >>>> seqlevel individually,
>> >>>> this really takes a long time..
>> >>>
>> >>> Yes this is a known weakness in the current implementation of
>> >>> coverage(). It loops over the seqlevels and this loop is currently
>> >>> performed in R, which is slow. Unfortunately this is not the first
>> >>> time someone complains about this.
>> >>>
>> >>> Time for us to bite the bullet I guess. I've just started to work
>> >>> on moving the loop to the C-level. That should speed up things
>> >>> significantly. I'll let you know when this is ready.
>> >>>
>> >>> Thanks for the feedback.
>> >>> H.
>> >>>
>> >>>>
>> >>>>
>> >>>> I would appreciate any comments!
>> >>>>
>> >>>> Best,
>> >>>> Stefanie
>> >>>>
>> >>>> _______________________________________________
>> >>>> Bioconductor mailing list
>> >>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>> Search the archives:
>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>
>> >>>
>> >>> --
>> >>> Hervé Pagès
>> >>>
>> >>> Program in Computational Biology
>> >>> Division of Public Health Sciences
>> >>> Fred Hutchinson Cancer Research Center
>> >>> 1100 Fairview Ave. N, M1-B514
>> >>> P.O. Box 19024
>> >>> Seattle, WA 98109-1024
>> >>>
>> >>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> >>> Phone:  (206) 667-5791
>> >>> Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
_____________________________________________________________

Stefanie Tauber, PhD

Center for Integrative Bioinformatics Vienna (CIBIV)
Max F. Perutz Laboratories (MFPL)
Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2
Dr. Bohr Gasse 9
A-1030 Wien, Austria
Phone: ++43 +1 / 42772-4030
Fax:   ++43 +1 / 42772-4098
email: stefanie.tauber at univie.ac.at