[Bioc-devel] Running Time of readBamGappedAlignmentPairs

Wed Aug 7 05:00:15 CEST 2013

Thanks. There is an improvement with those options. Not as much as the character creation example, but it does help.

   user  system elapsed 
1048.42   71.16 1120.75 

________________________________________
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Friday, 2 August 2013 9:24 PM
To: Hervé Pagès
Cc: Dario Strbenac; bioc-devel at r-project.org
Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs

On 07/31/2013 04:05 PM, Hervé Pagès wrote:
> Hi Dario,
>
> Thanks for providing the file.
>
> The file has 55780092 records in it and yes it only contains primary
> alignments so this is the best situation for the pairing algorithm.
>
> Here are the timings I get for the readBamGappedAlignmentPairs()
> function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
> Ubuntu server with 384 Gb of RAM:
>
> ## With BioC release (Rsamtools 1.12.3)
> ## ------------------------------------
> ## Timings:
> ##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
> ##     - scanBam:                6 min      (A1)

I haven't looked at Dario's data directly, but a surprising bottleneck is the
creation of large character vectors

$ R --vanilla
[...]
 >  i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  74.880   0.408  75.451

This can be alleviated by telling R that that you're likely to need lots of
memory up front; I alias R so that

$ R --min-vsize=2048M --min-nsize=20M --vanilla
[...]
 > i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  25.269   0.472  25.796

or

$ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
[...]
 > i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  12.196   0.464  12.687

These are documented on ?Memory and with

 > R.version.string
[1] "R version 3.0.1 Patched (2013-06-05 r62876)"

Martin

> ##     - other:                  4 min 30 s (A2)
> ##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
> ##     - findMateAlignment:      1 min 30 s (B1)
> ##     - other:                  4 min      (B2)
> ##   Total time:                16 min      (A+B)
> ## Memory usage: 13 Gb
>
> ## With BioC devel (Rsamtools 1.13.19)
> ## -----------------------------------
> ## Timings:
> ##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
> ##     - scanBam:               6 min      (A1)
> ##     - other:                 5 min      (A2)
> ##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
> ##     - findMateAlignment:     1 min      (B1)
> ##     - other:                 3 min      (B2)
> ##   Total time:               15 min      (A+B)
> ## Memory usage: 13 Gb
>
> Indeed, not much difference between release and devel :-/
>
> I didn't have the opportunity to do this kind of detailed timing of
> the readGAlignmentPairsFromBam() function on such a big file before,
> but I find it pretty interesting (I only benchmarked findMateAlignment()
> and it was on simulated data).
>
> In particular I'm quite surprised to discover that the pairing algo
> (implemented in findMateAlignment) is actually not the bottleneck
> (and it being optimized in BioC devel explains the small difference
> between total time in release and total time in devel).
>
> Just a note on the timings you sent earlier (I'm copying them here
> again):
>
>    > system.time(RNAreads <- readBamGappedAlignmentPairs(file))
>          user  system elapsed
>       1852.16   59.00 1939.11
>    > system.time(RNAreads <- readBamGappedAlignments(file))
>          user  system elapsed
>        192.80    8.12  214.97
>
> Subtracting the 2 times above to infer the time spent for the pairing
> is a shortcut that overlooks the following: when
> readBamGappedAlignmentPairs() calls readBamGappedAlignments()
> internally, it calls it with 'use.names=TRUE', and also with a 'param'
> arg that specifies the extra BAM fields, and also with a specific flag
> filter. All those things are required for the pairing algo. So a more
> sensible comparison is to compare:
>
>    RNAreads <- readBamGappedAlignmentPairs(file)
>
> versus:
>
>    flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
> hasUnmappedMate=FALSE)
>    what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos")
>    param0 <- ScanBamParam(flag=flag0, what=what0)
>    RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0)
>
> This internal call to readBamGappedAlignments() takes more than 70%
> of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
> with BioC devel on my Linux server). This is a little bit surprising
> to me and that's something I'd like to investigate.
>
> Cheers,
> H.
>
>
> On 07/30/2013 05:00 PM, Dario Strbenac wrote:
>> Hi.
>>
>> The files are
>>
>> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam
>> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai
>>
>> While you have them, could you also investigate the granges coercion function
>> ? It also took half an hour.
>>
>> ________________________________________
>> From: Hervé Pagès [hpages at fhcrc.org]
>> Sent: Tuesday, 30 July 2013 3:29 AM
>> To: Dario Strbenac
>> Cc: bioc-devel at r-project.org
>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>
>> On 07/29/2013 12:00 AM, Dario Strbenac wrote:
>>> Because I only allowed unique mappings, the ratio is 2. After installing the
>>> development package versions, I only got a 10% improvement.
>>
>> Mmmh that's disappointing. Can you put the file somewhere so I can have
>> a look? Thanks.
>>
>> H.
>>
>>>
>>>      user  system elapsed
>>> 1681.36   65.79 1752.10
>>>
>>> ________________________________________
>>> From: Pages, Herve [hpages at fhcrc.org]
>>> Sent: Monday, 29 July 2013 3:29 PM
>>> To: Dario Strbenac
>>> Cc: bioc-devel at r-project.org
>>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>>
>>> Hi Dario,
>>>
>>> The function was optimized in Bioc-devel. Depending on the number
>>> of records in your BAM file (which is more relevant than its size),
>>> it should now be between 2x and 5x faster. Please give it a try and
>>> let us know.
>>>
>>> Also keep in mind that the pairing algo will slow down if
>>> the "average nb of records per unique QNAME" is 3 or more.
>>> This was discussed here last month:
>>>     https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> ----- Original Message -----
>>> From: "Dario Strbenac" <dstr7320 at uni.sydney.edu.au>
>>> To: bioc-devel at r-project.org
>>> Sent: Sunday, July 28, 2013 9:00:30 PM
>>> Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>>
>>> Hello,
>>>
>>> Could readBamGappedAlignmentPairs be optimised ? It's surprising that it
>>> takes an extra 29 minutes to calculate the pairing information, for the
>>> example below. The BAM file is 4 GB in size.
>>>
>>>> system.time(RNAreads <- readBamGappedAlignmentPairs(file))
>>>      user  system elapsed
>>> 1852.16   59.00 1939.11
>>>> system.time(RNAreads <- readBamGappedAlignments(file))
>>>      user  system elapsed
>>>    192.80    8.12  214.97
>>>
>>> --------------------------------------
>>> Dario Strbenac
>>> PhD Student
>>> University of Sydney
>>> Camperdown NSW 2050
>>> Australia
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793