[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels
Stefanie Tauber
stefanie.tauber at univie.ac.at
Mon Oct 28 15:00:53 CET 2013
Here is my SessionInfo,
GenomicRanges as well as IRanges should be up-to-date.
R Under development (unstable) (2013-10-15 r64056)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.15.2 Biostrings_2.31.0 GenomicFeatures_1.15.2
[4] AnnotationDbi_1.25.2 Biobase_2.23.1 GenomicRanges_1.15.1
[7] XVector_0.3.0 IRanges_1.21.5 BiocGenerics_0.9.0
loaded via a namespace (and not attached):
[1] biomaRt_2.19.0 bitops_1.0-6 BSgenome_1.31.0 DBI_0.2-7
[5] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.23.1 stats4_3.1.0
[9] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.9.0
Best,
Stefanie
On Mo, 28.10.2013, 14:56, Michael Lawrence wrote:
> Probably need to update GenomicRanges.
>
>
> On Mon, Oct 28, 2013 at 5:11 AM, Stefanie Tauber <
> stefanie.tauber at univie.ac.at> wrote:
>
>> Hi Hervé,
>>
>> I just wanted to try the new coverage function:
>>
>> # I read in a bam file, human data mapped against the transcriptome
>> x = readGAlignments(human_trans_bam)
>>
>> # This is my GAligments object
>> > x
>> GAlignments with 46765032 alignments and 0 metadata columns:
>> seqnames strand cigar qwidth
>> <Rle> <Rle> <character> <integer>
>> [1] hg19_ensGene_ENST00000237247 + 75M 75
>> [2] hg19_ensGene_ENST00000371036 - 75M 75
>> [3] hg19_ensGene_ENST00000371037 - 75M 75
>> [4] hg19_ensGene_ENST00000471889 - 75M 75
>> [5] hg19_ensGene_ENST00000377479 + 75M 75
>> ... ... ... ... ...
>> [46765028] hg19_ensGene_ENST00000344867 - 75M 75
>> [46765029] hg19_ensGene_ENST00000400848 - 75M 75
>> [46765030] hg19_ensGene_ENST00000400848 - 75M 75
>> [46765031] hg19_ensGene_ENST00000400848 - 75M 75
>> [46765032] hg19_ensGene_ENST00000400829 - 75M 75
>> start end width ngap
>> <integer> <integer> <integer> <integer>
>> [1] 1783 1857 75 0
>> [2] 690 764 75 0
>> [3] 1632 1706 75 0
>> [4] 2121 2195 75 0
>> [5] 391 465 75 0
>> ... ... ... ... ...
>> [46765028] 280 354 75 0
>> [46765029] 509 583 75 0
>> [46765030] 509 583 75 0
>> [46765031] 509 583 75 0
>> [46765032] 179 253 75 0
>> ---
>> seqlengths:
>> hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>> 2580 ... 999
>>
>> When I want to compute the coverage, I get the following:
>> > cov_x = coverage(x)
>>
>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>> object '.Ranges.coverage' not found
>>
>> Do you know what I am doing wrong?
>>
>> Thanks,
>> Stefanie
>>
>>
>>
>> On Fr, 25.10.2013, 11:44, Hervé Pagès wrote:
>> > Hi Stephanie,
>> >
>> > Just to let you know that in GenomicRanges 1.14.2, coverage() should
>> > now be much faster on objects with many seqlevels. Should be about
>> > 1000x faster or more if you have 50 000 seqlevels.
>> >
>> > The only case where it's still slow is if many seqlevels are marked as
>> > circular (and have a non-NA seqlength) in seqinfo(x), because for
>> > these seqlevels the coverage vector is still post processed by a
>> > loop in R. Could be a problem if for example people have BAM files
>> > with reads aligned to thousands of bacterial genomes.
>> >
>> > GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
>> > biocLite() in the next 24 hours or so. The man pages for the
>> "coverage"
>> > methods have been improved, especially in IRanges.
>> >
>> > Cheers,
>> > H.
>> >
>> >
>> > On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>> >> Thanks for the answer!
>> >> Just wanted to make sure that I don't miss a relevant feature….
>> >>
>> >> Best,
>> >> Stefanie
>> >>
>> >> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>> >> <mailto:hpages at fhcrc.org>>:
>> >>
>> >>> Hi Stephanie,
>> >>>
>> >>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>> >>>> What is the best way to compute the coverage on a ReadGAlignments
>> >>>> object
>> >>>> when I have many, many (about 50 000) seqlevels?
>> >>>>
>> >>>> Just computing coverage(object) is terribly slow due to the huge
>> >>>> amount of seqlevels.
>> >>>>
>> >>>> At the moment, I am subsetting the ReadGAlignments object for each
>> >>>> seqlevel,
>> >>>> I omit the redundant seqlevels and calculate the coverage.
>> >>>> This is fast if you do it for a few seqlevels.
>> >>>> But if one really would like to compute the coverage for each
>> >>>> seqlevel individually,
>> >>>> this really takes a long time..
>> >>>
>> >>> Yes this is a known weakness in the current implementation of
>> >>> coverage(). It loops over the seqlevels and this loop is currently
>> >>> performed in R, which is slow. Unfortunately this is not the first
>> >>> time someone complains about this.
>> >>>
>> >>> Time for us to bite the bullet I guess. I've just started to work
>> >>> on moving the loop to the C-level. That should speed up things
>> >>> significantly. I'll let you know when this is ready.
>> >>>
>> >>> Thanks for the feedback.
>> >>> H.
>> >>>
>> >>>>
>> >>>>
>> >>>> I would appreciate any comments!
>> >>>>
>> >>>> Best,
>> >>>> Stefanie
>> >>>>
>> >>>> _______________________________________________
>> >>>> Bioconductor mailing list
>> >>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>> Search the archives:
>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>
>> >>>
>> >>> --
>> >>> Hervé Pagès
>> >>>
>> >>> Program in Computational Biology
>> >>> Division of Public Health Sciences
>> >>> Fred Hutchinson Cancer Research Center
>> >>> 1100 Fairview Ave. N, M1-B514
>> >>> P.O. Box 19024
>> >>> Seattle, WA 98109-1024
>> >>>
>> >>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> >>> Phone: (206) 667-5791
>> >>> Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
--
_____________________________________________________________
Stefanie Tauber, PhD
Center for Integrative Bioinformatics Vienna (CIBIV)
Max F. Perutz Laboratories (MFPL)
Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2
Dr. Bohr Gasse 9
A-1030 Wien, Austria
Phone: ++43 +1 / 42772-4030
Fax: ++43 +1 / 42772-4098
email: stefanie.tauber at univie.ac.at
More information about the Bioconductor
mailing list