[Bioc-devel] Running Time of readBamGappedAlignmentPairs

Fri Aug 2 19:53:11 CEST 2013

On 08/02/2013 07:33 AM, Michael Lawrence wrote:
> This has come up several times I think, and I'm totally naive, but what about
> making R smart enough to do that itself? One nice thing about the vectorization
> is that R knows how many CHARSXPs it needs to allocate up-front, or at least an
> upper-bound when they're all unique. Is the issue that the R heap needs to be
> initialized in a special way and its behavior can't change at run-time? It's a
> tricky thing to get right, I'm sure.

Currently, R allocates space for a STRSXP containing n = 55780092/2 elements, 
but for all R knows these could all be identical and no CHARSXP's are allocated; 
I don't think there's a way to pre-allocate n CHARSXPs. As the elements are 
filled, R sometimes needs to do garbage collection, and it has the increasingly 
onerous task of checking each CHARSXP to see whether it is in use (I think -- it 
seems like an in-use STRSXP might somehow be used to short-circuit this?). Maybe 
it would be helpful to be able to pre-allocate space for n unique CHARSXPs.

It seems like R_GC_MEM_GROW could be made accessible via C calls, but presumably 
this is not a good thing for user code to be entrusted with.

Martin

>
>
> On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 07/31/2013 04:05 PM, Hervé Pagès wrote:
>
>         Hi Dario,
>
>         Thanks for providing the file.
>
>         The file has 55780092 records in it and yes it only contains primary
>         alignments so this is the best situation for the pairing algorithm.
>
>         Here are the timings I get for the readBamGappedAlignmentPairs()
>         function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
>         Ubuntu server with 384 Gb of RAM:
>
>         ## With BioC release (Rsamtools 1.12.3)
>         ## ------------------------------__------
>         ## Timings:
>         ##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
>         ##     - scanBam:                6 min      (A1)
>
>
>     I haven't looked at Dario's data directly, but a surprising bottleneck is
>     the creation of large character vectors
>
>     $ R --vanilla
>     [...]
>      >  i = seq_len(55780092/2)
>      > system.time(as.character(i))
>         user  system elapsed
>       74.880   0.408  75.451
>
>     This can be alleviated by telling R that that you're likely to need lots of
>     memory up front; I alias R so that
>
>     $ R --min-vsize=2048M --min-nsize=20M --vanilla
>     [...]
>      > i = seq_len(55780092/2)
>      > system.time(as.character(i))
>         user  system elapsed
>       25.269   0.472  25.796
>
>     or
>
>     $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
>     [...]
>      > i = seq_len(55780092/2)
>      > system.time(as.character(i))
>         user  system elapsed
>       12.196   0.464  12.687
>
>     These are documented on ?Memory and with
>
>      > R.version.string
>     [1] "R version 3.0.1 Patched (2013-06-05 r62876)"
>
>
>     Martin
>
>
>         ##     - other:                  4 min 30 s (A2)
>         ##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
>         ##     - findMateAlignment:      1 min 30 s (B1)
>         ##     - other:                  4 min      (B2)
>         ##   Total time:                16 min      (A+B)
>         ## Memory usage: 13 Gb
>
>         ## With BioC devel (Rsamtools 1.13.19)
>         ## ------------------------------__-----
>         ## Timings:
>         ##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
>         ##     - scanBam:               6 min      (A1)
>         ##     - other:                 5 min      (A2)
>         ##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
>         ##     - findMateAlignment:     1 min      (B1)
>         ##     - other:                 3 min      (B2)
>         ##   Total time:               15 min      (A+B)
>         ## Memory usage: 13 Gb
>
>         Indeed, not much difference between release and devel :-/
>
>         I didn't have the opportunity to do this kind of detailed timing of
>         the readGAlignmentPairsFromBam() function on such a big file before,
>         but I find it pretty interesting (I only benchmarked findMateAlignment()
>         and it was on simulated data).
>
>         In particular I'm quite surprised to discover that the pairing algo
>         (implemented in findMateAlignment) is actually not the bottleneck
>         (and it being optimized in BioC devel explains the small difference
>         between total time in release and total time in devel).
>
>         Just a note on the timings you sent earlier (I'm copying them here
>         again):
>
>             > system.time(RNAreads <- readBamGappedAlignmentPairs(__file))
>                   user  system elapsed
>                1852.16   59.00 1939.11
>             > system.time(RNAreads <- readBamGappedAlignments(file))
>                   user  system elapsed
>                 192.80    8.12  214.97
>
>         Subtracting the 2 times above to infer the time spent for the pairing
>         is a shortcut that overlooks the following: when
>         readBamGappedAlignmentPairs() calls readBamGappedAlignments()
>         internally, it calls it with 'use.names=TRUE', and also with a 'param'
>         arg that specifies the extra BAM fields, and also with a specific flag
>         filter. All those things are required for the pairing algo. So a more
>         sensible comparison is to compare:
>
>             RNAreads <- readBamGappedAlignmentPairs(__file)
>
>         versus:
>
>             flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
>         hasUnmappedMate=FALSE)
>             what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos")
>             param0 <- ScanBamParam(flag=flag0, what=what0)
>             RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0)
>
>         This internal call to readBamGappedAlignments() takes more than 70%
>         of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
>         with BioC devel on my Linux server). This is a little bit surprising
>         to me and that's something I'd like to investigate.
>
>         Cheers,
>         H.
>
>
>         On 07/30/2013 05:00 PM, Dario Strbenac wrote:
>
>             Hi.
>
>             The files are
>
>             http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam
>             <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam>
>             http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam.__bai
>             <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai>
>
>             While you have them, could you also investigate the granges coercion
>             function
>             ? It also took half an hour.
>
>             __________________________________________
>             From: Hervé Pagès [hpages at fhcrc.org <mailto:hpages at fhcrc.org>]
>             Sent: Tuesday, 30 July 2013 3:29 AM
>             To: Dario Strbenac
>             Cc: bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>
>             Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>
>             On 07/29/2013 12:00 AM, Dario Strbenac wrote:
>
>                 Because I only allowed unique mappings, the ratio is 2. After
>                 installing the
>                 development package versions, I only got a 10% improvement.
>
>
>             Mmmh that's disappointing. Can you put the file somewhere so I can have
>             a look? Thanks.
>
>             H.
>
>
>                       user  system elapsed
>                 1681.36 65.79 1752.10 <tel:65.79%201752.10>
>
>                 __________________________________________
>                 From: Pages, Herve [hpages at fhcrc.org <mailto:hpages at fhcrc.org>]
>                 Sent: Monday, 29 July 2013 3:29 PM
>                 To: Dario Strbenac
>                 Cc: bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>
>                 Subject: Re: [Bioc-devel] Running Time of
>                 readBamGappedAlignmentPairs
>
>                 Hi Dario,
>
>                 The function was optimized in Bioc-devel. Depending on the number
>                 of records in your BAM file (which is more relevant than its size),
>                 it should now be between 2x and 5x faster. Please give it a try and
>                 let us know.
>
>                 Also keep in mind that the pairing algo will slow down if
>                 the "average nb of records per unique QNAME" is 3 or more.
>                 This was discussed here last month:
>                 https://stat.ethz.ch/__pipermail/bioconductor/2013-__June/053390.html
>                 <https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html>
>
>                 Cheers,
>                 H.
>
>
>                 ----- Original Message -----
>                 From: "Dario Strbenac" <dstr7320 at uni.sydney.edu.au
>                 <mailto:dstr7320 at uni.sydney.edu.au>>
>                 To: bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>
>                 Sent: Sunday, July 28, 2013 9:00:30 PM
>                 Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>
>                 Hello,
>
>                 Could readBamGappedAlignmentPairs be optimised ? It's surprising
>                 that it
>                 takes an extra 29 minutes to calculate the pairing information,
>                 for the
>                 example below. The BAM file is 4 GB in size.
>
>                     system.time(RNAreads <- readBamGappedAlignmentPairs(__file))
>
>                       user  system elapsed
>                 1852.16   59.00 1939.11
>
>                     system.time(RNAreads <- readBamGappedAlignments(file))
>
>                       user  system elapsed
>                     192.80    8.12  214.97
>
>                 ------------------------------__--------
>                 Dario Strbenac
>                 PhD Student
>                 University of Sydney
>                 Camperdown NSW 2050
>                 Australia
>
>                 _________________________________________________
>                 Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                 mailing list
>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>                 _________________________________________________
>                 Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                 mailing list
>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>             --
>             Hervé Pagès
>
>             Program in Computational Biology
>             Division of Public Health Sciences
>             Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N, M1-B514
>             P.O. Box 19024
>             Seattle, WA 98109-1024
>
>             E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>
>     --
>     Computational Biology / Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>     Location: Arnold Building M1 B861
>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793