[BioC] coverage on very large ReadGappedAlignmentsObject
Hervé Pagès
hpages at fhcrc.org
Wed May 8 23:34:47 CEST 2013
Hi,
On 05/08/2013 08:50 AM, Kasper Daniel Hansen wrote:
> I have observed big slowdowns with GRanges when the number of seqlevels is
> large; not sure if this is related. My observation was mostly about
> findOverlaps.
Right. It's the same issue here: the "coverage" method for
GenomicRanges objects (which is what is called behind the scene when
calling coverage() on a GappedAlignments object) is looping over the
seqlevels (with an lapply). The overhead of that loop is big:
## Only 1 seqlevel:
gr <- GRanges("chr1", IRanges(35:5, width=10))
seqlengths(gr) <- 100
> system.time(cvg1 <- coverage(gr))
user system elapsed
0.016 0.000 0.013
## Add 50000 artificial seqlevels:
fakechroms <- Seqinfo(paste0("fakechr", 1:50000),
seqlengths=rep(200, 50000))
seqinfo(gr) <- merge(seqinfo(gr), fakechroms)
> system.time(cvg2 <- coverage(gr))
user system elapsed
134.744 0.376 135.417
Moving the loop at the C level would probably help. I will look into
that and report back here when it's ready.
Cheers,
H.
>
>
> On Wed, May 8, 2013 at 11:27 AM, Stefanie Tauber <
> stefanie.tauber at univie.ac.at> wrote:
>
>> To be more specific,
>> it takes about 10 minutes to calculate the coverage of 1 x 10^6 reads on
>> my pc (linux, 4 cpus, 8GB RAM)
>>
>>
>> This seems a bit long to me, what do you think?
>>
>>
>>
>> Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve at gene.com>:
>>
>>> Another approach to what Martin suggests would be to simply read in the
>> reads and process them (do the coverage calculation per transcript stuff)
>> one chromosome at a time.
>>>
>>> On Wednesday, May 8, 2013, Martin Morgan wrote:
>>> On 05/08/2013 04:40 AM, Stefanie Tauber wrote:
>>> Dear list,
>>>
>>> I have some human data, mapped with bowtie to the transcriptome and then
>>> further processed with eXpress.
>>> The output of eXpress is a sorted bam file, containing the alignments of
>>> the reads to different transcripts.
>>>
>>> So, the bam file contains about 48 x 10^6 alignments, distributed over
>>> about 50000 transcripts.
>>> I read in the bam file with 'readGappedAlignments'.
>>>
>>>
>>> reads
>>> GappedAlignments with 46765032 alignments and 0 metadata columns:
>>> seqnames strand cigar qwidth
>>> <Rle> <Rle> <character> <integer>
>>> [1] hg19_ensGene_ENST00000237247 + 75M 75
>>> [2] hg19_ensGene_ENST00000371036 - 75M 75
>>> [3] hg19_ensGene_ENST00000371037 - 75M 75
>>> [4] hg19_ensGene_ENST00000471889 - 75M 75
>>> [5] hg19_ensGene_ENST00000377479 + 75M 75
>>> ... ... ... ... ...
>>> [46765028] hg19_ensGene_ENST00000344867 - 75M 75
>>> [46765029] hg19_ensGene_ENST00000400848 - 75M 75
>>> [46765030] hg19_ensGene_ENST00000400848 - 75M 75
>>> [46765031] hg19_ensGene_ENST00000400848 - 75M 75
>>> [46765032] hg19_ensGene_ENST00000400829 - 75M 75
>>> start end width ngap
>>> <integer> <integer> <integer> <integer>
>>> [1] 1783 1857 75 0
>>> [2] 690 764 75 0
>>> [3] 1632 1706 75 0
>>> [4] 2121 2195 75 0
>>> [5] 391 465 75 0
>>> ... ... ... ... ...
>>> [46765028] 280 354 75 0
>>> [46765029] 509 583 75 0
>>> [46765030] 509 583 75 0
>>> [46765031] 509 583 75 0
>>> [46765032] 179 253 75 0
>>> ---
>>> seqlengths:
>>> hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>>> 2580 ... 999
>>>
>>>
>>>
>>>
>>> Now, for whatever reason, I would like to have the coverage vector for
>>> each transcript.
>>> So, basically I would just compute
>>> coverage(reads).
>>>
>>> This works very well for smaller genomes and experiments (e.g. yeast with
>>> 10 x 10^6 alignments).
>>>
>>> But, using 'coverage' on this large ReadGappedAlignments Object takes
>>> really very very long.
>>>
>>> Hi Stefanie --
>>>
>>> One reason might be that you are on the edge of physical memory
>> availability and you are swapping to disk. A solution is to read in chunks,
>> which turns out to be very easy
>>>
>>> bam = BamFile(<path/to/bam>, yieldSize = 10000000)
>>> open(bam)
>>> cvg = NULL
>>> while (length(reads <- readGappedAlignments(bam, <...>))) {
>>> if (cvg == NULL)
>>> cvg = coverage(reads)
>>> else cvg = cvg + coverage(reads)
>>> }
>>> close(bam)
>>>
>>> This will be relatively efficient; yieldSize can be adjusted, including
>> to a size so that several bam files can be processed in parallel. Start
>> with a small yieldSize to quickly debug the iteration, before processing
>> the whole file. It can be further simplified using coverage,BamFile-method
>> directly, if you're not wanting to do anything else with the gapped
>> alignments.
>>>
>>> Martin
>>>
>>>
>>> Is there a way to do this better (in R)? Or is this just not feasible to
>>> do in R and I should rather preprocess the data outside R?
>>>
>>> I would appreciate any comments,
>>> best,
>>> Stefanie
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Computational Biologist
>>> Department of Bioinformatics and Computational Biology
>>> Genentech
>>>
>>
>> DI Stefanie Tauber
>>
>> Center for Integrative Bioinformatics Vienna (CIBIV)
>> (CIBIV is a joint institute of Vienna University and Medical University)
>> Max F. Perutz Laboratories (MFPL)
>> Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2
>> Dr. Bohr Gasse 9
>> A-1030 Wien, Austria
>> Phone: ++43 +1 / 42772-4030
>> Fax: ++43 +1 / 42772-4098
>> email: stefanie.tauber at univie.ac.at
>> www.cibiv.at
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list