[BioC] coverage on very large ReadGappedAlignmentsObject

Wed May 8 13:40:59 CEST 2013

Dear list,

I have some human data, mapped with bowtie to the transcriptome and then
further processed with eXpress.
The output of eXpress is a sorted bam file, containing the alignments of
the reads to different transcripts.

So, the bam file contains about 48 x 10^6 alignments, distributed over
about 50000 transcripts.
I read in the bam file with 'readGappedAlignments'.

> reads
GappedAlignments with 46765032 alignments and 0 metadata columns:
                                 seqnames strand       cigar    qwidth
                                    <Rle>  <Rle> <character> <integer>
         [1] hg19_ensGene_ENST00000237247      +         75M        75
         [2] hg19_ensGene_ENST00000371036      -         75M        75
         [3] hg19_ensGene_ENST00000371037      -         75M        75
         [4] hg19_ensGene_ENST00000471889      -         75M        75
         [5] hg19_ensGene_ENST00000377479      +         75M        75
         ...                          ...    ...         ...       ...
  [46765028] hg19_ensGene_ENST00000344867      -         75M        75
  [46765029] hg19_ensGene_ENST00000400848      -         75M        75
  [46765030] hg19_ensGene_ENST00000400848      -         75M        75
  [46765031] hg19_ensGene_ENST00000400848      -         75M        75
  [46765032] hg19_ensGene_ENST00000400829      -         75M        75
                 start       end     width      ngap
             <integer> <integer> <integer> <integer>
         [1]      1783      1857        75         0
         [2]       690       764        75         0
         [3]      1632      1706        75         0
         [4]      2121      2195        75         0
         [5]       391       465        75         0
         ...       ...       ...       ...       ...
  [46765028]       280       354        75         0
  [46765029]       509       583        75         0
  [46765030]       509       583        75         0
  [46765031]       509       583        75         0
  [46765032]       179       253        75         0
  ---
  seqlengths:
   hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
                           2580 ...                          999

Now, for whatever reason, I would like to have the coverage vector for
each transcript.
So, basically I would just compute
coverage(reads).

This works very well for smaller genomes and experiments (e.g. yeast with
10 x 10^6 alignments).

But, using 'coverage' on this large ReadGappedAlignments Object takes
really very very long.
Is there a way to do this better (in R)? Or is this just not feasible to
do in R and I should rather preprocess the data outside R?

I would appreciate any comments,
best,
Stefanie