[BioC] coverage on very large ReadGappedAlignmentsObject
Martin Morgan
mtmorgan at fhcrc.org
Wed May 8 14:34:32 CEST 2013
On 05/08/2013 04:40 AM, Stefanie Tauber wrote:
> Dear list,
>
> I have some human data, mapped with bowtie to the transcriptome and then
> further processed with eXpress.
> The output of eXpress is a sorted bam file, containing the alignments of
> the reads to different transcripts.
>
> So, the bam file contains about 48 x 10^6 alignments, distributed over
> about 50000 transcripts.
> I read in the bam file with 'readGappedAlignments'.
>
>
>> reads
> GappedAlignments with 46765032 alignments and 0 metadata columns:
> seqnames strand cigar qwidth
> <Rle> <Rle> <character> <integer>
> [1] hg19_ensGene_ENST00000237247 + 75M 75
> [2] hg19_ensGene_ENST00000371036 - 75M 75
> [3] hg19_ensGene_ENST00000371037 - 75M 75
> [4] hg19_ensGene_ENST00000471889 - 75M 75
> [5] hg19_ensGene_ENST00000377479 + 75M 75
> ... ... ... ... ...
> [46765028] hg19_ensGene_ENST00000344867 - 75M 75
> [46765029] hg19_ensGene_ENST00000400848 - 75M 75
> [46765030] hg19_ensGene_ENST00000400848 - 75M 75
> [46765031] hg19_ensGene_ENST00000400848 - 75M 75
> [46765032] hg19_ensGene_ENST00000400829 - 75M 75
> start end width ngap
> <integer> <integer> <integer> <integer>
> [1] 1783 1857 75 0
> [2] 690 764 75 0
> [3] 1632 1706 75 0
> [4] 2121 2195 75 0
> [5] 391 465 75 0
> ... ... ... ... ...
> [46765028] 280 354 75 0
> [46765029] 509 583 75 0
> [46765030] 509 583 75 0
> [46765031] 509 583 75 0
> [46765032] 179 253 75 0
> ---
> seqlengths:
> hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
> 2580 ... 999
>
>
>
>
> Now, for whatever reason, I would like to have the coverage vector for
> each transcript.
> So, basically I would just compute
> coverage(reads).
>
> This works very well for smaller genomes and experiments (e.g. yeast with
> 10 x 10^6 alignments).
>
> But, using 'coverage' on this large ReadGappedAlignments Object takes
> really very very long.
Hi Stefanie --
One reason might be that you are on the edge of physical memory availability and
you are swapping to disk. A solution is to read in chunks, which turns out to be
very easy
bam = BamFile(<path/to/bam>, yieldSize = 10000000)
open(bam)
cvg = NULL
while (length(reads <- readGappedAlignments(bam, <...>))) {
if (cvg == NULL)
cvg = coverage(reads)
else cvg = cvg + coverage(reads)
}
close(bam)
This will be relatively efficient; yieldSize can be adjusted, including to a
size so that several bam files can be processed in parallel. Start with a small
yieldSize to quickly debug the iteration, before processing the whole file. It
can be further simplified using coverage,BamFile-method directly, if you're not
wanting to do anything else with the gapped alignments.
Martin
> Is there a way to do this better (in R)? Or is this just not feasible to
> do in R and I should rather preprocess the data outside R?
>
> I would appreciate any comments,
> best,
> Stefanie
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list