[Bioc-devel] Running Time of readBamGappedAlignmentPairs

Hervé Pagès hpages at fhcrc.org
Thu Aug 1 01:05:07 CEST 2013


Hi Dario,

Thanks for providing the file.

The file has 55780092 records in it and yes it only contains primary
alignments so this is the best situation for the pairing algorithm.

Here are the timings I get for the readBamGappedAlignmentPairs()
function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
Ubuntu server with 384 Gb of RAM:

## With BioC release (Rsamtools 1.12.3)
## ------------------------------------
## Timings:
##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
##     - scanBam:                6 min      (A1)
##     - other:                  4 min 30 s (A2)
##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
##     - findMateAlignment:      1 min 30 s (B1)
##     - other:                  4 min      (B2)
##   Total time:                16 min      (A+B)
## Memory usage: 13 Gb

## With BioC devel (Rsamtools 1.13.19)
## -----------------------------------
## Timings:
##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
##     - scanBam:               6 min      (A1)
##     - other:                 5 min      (A2)
##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
##     - findMateAlignment:     1 min      (B1)
##     - other:                 3 min      (B2)
##   Total time:               15 min      (A+B)
## Memory usage: 13 Gb

Indeed, not much difference between release and devel :-/

I didn't have the opportunity to do this kind of detailed timing of
the readGAlignmentPairsFromBam() function on such a big file before,
but I find it pretty interesting (I only benchmarked findMateAlignment()
and it was on simulated data).

In particular I'm quite surprised to discover that the pairing algo
(implemented in findMateAlignment) is actually not the bottleneck
(and it being optimized in BioC devel explains the small difference
between total time in release and total time in devel).

Just a note on the timings you sent earlier (I'm copying them here
again):

   > system.time(RNAreads <- readBamGappedAlignmentPairs(file))
         user  system elapsed
      1852.16   59.00 1939.11
   > system.time(RNAreads <- readBamGappedAlignments(file))
         user  system elapsed
       192.80    8.12  214.97

Subtracting the 2 times above to infer the time spent for the pairing
is a shortcut that overlooks the following: when
readBamGappedAlignmentPairs() calls readBamGappedAlignments()
internally, it calls it with 'use.names=TRUE', and also with a 'param'
arg that specifies the extra BAM fields, and also with a specific flag
filter. All those things are required for the pairing algo. So a more
sensible comparison is to compare:

   RNAreads <- readBamGappedAlignmentPairs(file)

versus:

   flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, 
hasUnmappedMate=FALSE)
   what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos")
   param0 <- ScanBamParam(flag=flag0, what=what0)
   RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0)

This internal call to readBamGappedAlignments() takes more than 70%
of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
with BioC devel on my Linux server). This is a little bit surprising
to me and that's something I'd like to investigate.

Cheers,
H.


On 07/30/2013 05:00 PM, Dario Strbenac wrote:
> Hi.
>
> The files are
>
> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam
> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai
>
> While you have them, could you also investigate the granges coercion function ? It also took half an hour.
>
> ________________________________________
> From: Hervé Pagès [hpages at fhcrc.org]
> Sent: Tuesday, 30 July 2013 3:29 AM
> To: Dario Strbenac
> Cc: bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>
> On 07/29/2013 12:00 AM, Dario Strbenac wrote:
>> Because I only allowed unique mappings, the ratio is 2. After installing the development package versions, I only got a 10% improvement.
>
> Mmmh that's disappointing. Can you put the file somewhere so I can have
> a look? Thanks.
>
> H.
>
>>
>>      user  system elapsed
>> 1681.36   65.79 1752.10
>>
>> ________________________________________
>> From: Pages, Herve [hpages at fhcrc.org]
>> Sent: Monday, 29 July 2013 3:29 PM
>> To: Dario Strbenac
>> Cc: bioc-devel at r-project.org
>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>
>> Hi Dario,
>>
>> The function was optimized in Bioc-devel. Depending on the number
>> of records in your BAM file (which is more relevant than its size),
>> it should now be between 2x and 5x faster. Please give it a try and
>> let us know.
>>
>> Also keep in mind that the pairing algo will slow down if
>> the "average nb of records per unique QNAME" is 3 or more.
>> This was discussed here last month:
>>     https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html
>>
>> Cheers,
>> H.
>>
>>
>> ----- Original Message -----
>> From: "Dario Strbenac" <dstr7320 at uni.sydney.edu.au>
>> To: bioc-devel at r-project.org
>> Sent: Sunday, July 28, 2013 9:00:30 PM
>> Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>
>> Hello,
>>
>> Could readBamGappedAlignmentPairs be optimised ? It's surprising that it takes an extra 29 minutes to calculate the pairing information, for the example below. The BAM file is 4 GB in size.
>>
>>> system.time(RNAreads <- readBamGappedAlignmentPairs(file))
>>      user  system elapsed
>> 1852.16   59.00 1939.11
>>> system.time(RNAreads <- readBamGappedAlignments(file))
>>      user  system elapsed
>>    192.80    8.12  214.97
>>
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list