[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Thu Jul 3 21:11:18 CEST 2014

fyi the test files ('fls') came from,

library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

Valerie


On 07/03/2014 10:42 AM, Valerie Obenchain wrote:
> No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved
> file. Files with more diversity may have more than 2 segments per template.
>
> counts <- lapply(fls, function(fl)
>    countBam(fl, param=param)$records)
>
> pairs <- lapply(fls, function(fl)
>    length(readGAlignmentPairs(fl, param=param)))
>
>>> unlist(counts)/2
>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>> ERR127305
>>    363326    392420    397158    349970    355376    381493
>> 389176    355834
>>> unlist(pairs)
>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>> ERR127305
>>    363147    392252    396933    349822    355215    381274
>> 388991    355682
>
> To see what's going on use readGAlignementsList() on the first file.
> This function adds some flexibility in that it reads in fragments, pairs
> w/ no mates, etc.
>
> galist <- readGAlignmentsList(fl, param=param)
>
> The length matches that of countBam() because it reads all records.
>
>>> length(galist)
>> [1] 363226
>
> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'.
> Here we see the 363147 match to output from readGAlignmentsPairs() and
> the rest are ambiguous.
>
>>> table(mcols(galist)$mate_status)
>>
>>     mated ambiguous   unmated
>>    363147        79         0
>
> elementLengths() shows the number of segments per template. Again we see
> the 363147 match and it shows exactly 2 segments for those.
>
>>> table(elementLengths(galist))
>>
>>      2      4      6      8
>> 363147     65      7      7
>
>
> Valerie
>
>
> On 07/03/2014 10:08 AM, Fong Chun Chan wrote:
>> Thanks for the reply. I understand what you mean about counting aligned
>> pairs within a region and how that would give incorrect results.
>>
>> However, if I was purely interested in just getting all aligned
>> read-pairs within the whole genomic space (no restriction to any region
>> since this is what I need to normalize by), wouldn't the following code
>> work?
>>
>> library('Rsamtools')
>> library('GenomicAlignments')
>>
>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>>
>> flags <- scanBamFlag(isPaired = TRUE,
>>                       isProperPair = TRUE,
>>                       isUnmappedQuery = FALSE,
>>                       hasUnmappedMate = FALSE)
>>
>> param <- ScanBamParam(flag = flags)
>>
>> countBam(fl, param = param)[['records']]/2
>> [1]1572
>>
>> galp <- readGAlignmentPairs(fl, param=param)
>> length(galp)
>> [1]1572
>>
>>
>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>>     Correction below.
>>
>>
>>     On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
>>
>>         Hi,
>>
>>         On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>>
>>             Hi Valerie,
>>
>>             Thanks for the reply. I'll post some runtime as soon as it
>>             finishes
>>             running the readGAlignments and readGAlignmentPairs.
>>
>>             Thanks for the tip on passing range(allExons) into the which
>>             param.
>>             There is only one issue in that if I wanted to then generate
>>             say RPKM
>>             values where I need to know the "total number of aligned
>>             reads" in a bam
>>             file. Because I've read in just a subset of the bam (i.e.
>>             the reads that
>>             aligned just to exons) I would have technically produced an
>>             incorrection
>>             calculation of RPKM values (assuming you normalize using the
>>             number of
>>             aligned reads). On that note, is there a quick and easy way
>>             to just
>>             count the number of aligned reads or aligned paired reads so
>>             I can
>>             quickly normalize by this way. This way I can still use the
>>             which
>>             parameter in readGAlignments and readGAlignmentPairs to
>>             speed it up?
>>
>>
>>         For single-end you can use countBam():
>>
>>           > fl <- system.file("extdata", "ex1.bam", package =
>> "Rsamtools")
>>           > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery =
>>         FALSE))
>>           > countBam(fl, param = param)
>>             space start end width    file records nucleotides
>>         1    NA    NA  NA    NA ex1.bam    3271      115286
>>
>>         It's not so clear for paired-end. The mate-pairing algorithm
>>         used in the
>>         readGAlignment* functions is described on the man page and uses a
>>         combination of BAM fields and flags. You can't really get at
>>         this simply
>>         by creating a param with a flag combination such as
>>
>>         isPaired = TRUE
>>         isProperPair = TRUE
>>         isUnmappedQuery = FALSE
>>         hasUnmappedMate = FALSE
>>
>>         For example, if you're interested in this region
>>
>>         gr <- GRanges("seq1", IRanges(100, 500))
>>
>>         and set flags,
>>
>>         flags <- scanBamFlag(isPaired = TRUE,
>>                                isProperPair = TRUE,
>>                                isUnmappedQuery = FALSE,
>>                                hasUnmappedMate = FALSE)
>>
>>         countBam() reports 335 records. These are individual records so
>>         the mate
>>         pairs (parts) are each counted separately. Only records that
>>         fall within
>>         the region defined by 'gr' are counted.
>>
>>           > param <- ScanBamParam(which = gr, flag = flags)
>>           > countBam(fl, param = param)
>>             space start end width    file records nucleotides
>>         1  seq1   100 500   401 ex1.bam     335       11785
>>
>>
>>         readGAlignmentPairs() reports 223 pairs (446 individual
>>         records). All
>>         records in 'gr' are mated if possible. In the case where one
>>         mate part
>>         falls inside 'gr' and one outside, the other mate will be
>> retrieved
>>         (i.e., outside the range of 'gr'). This makes the 335 above
>>         confusing
>>         for counting the number of aligned pairs in the region. The
>>         number is a
>>         mixture of (1) mate parts both in 'gr' that belong together
>> (2) mate
>>         parts with mate outside 'gr' and (3) mates or mate parts that
>>         don't meet
>>         other mate-pairing criteria.
>>
>>           > galp <- readGAlignmentPairs(fl, param=param)
>>           > length(galp)
>>         [1] 223
>>
>>         Off hand I can't think of a good way to count aligned pairs
>> w/in a
>>         specified region.
>>
>>
>>     What I should have said here is I can't think of a good way to
>>     accurately count aligned pairs (1 count per pair) in a BAM file w/o
>>     using the pairing algo.
>>
>>     Valerie
>>
>>
>>
>>             Also another thing I notice is that when I use the
>>             readGAlignmentPairs function, I lose my ability to cancel
>>             the command
>>             (ctrl-c no longer works).
>>
>>
>>         It's implemented in C/C++. You can try ctrl+\ in the session
>> or kill
>>         with the top command from another terminal. Maybe others have
>>         suggestions they'll add ...
>>
>>         Valerie
>>
>>         Not sure this is a bug, or if it is something
>>
>>             to do with the internal workings of the function. The loss
>>             of ctrl-c
>>             also occurs when trying to run as a Rscript. I can seemingly
>>             ctrl-c
>>             anytime before, but it hits this step it just refuses to
>> cancel.
>>
>>             Thanks,
>>
>>
>>
>>
>>
>>
>>             On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain
>>             <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>             <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>
>> wrote:
>>
>>                  Hi,
>>
>>                  You don't need to call findMateAlignment() when using
>>                  readGAlignmentPairs(). In a previous iteration,
>>                  readGAlignmentPairs() called findMateAlignment() but
>>             this is no
>>                  longer the case.
>>
>>                  readGAlignmentPairs() does have more overhead than
>>             readGAlignments()
>>                  because of the mate paring. These numbers are for a BAM
>>             file with
>>                  800484 records (400054 final pairs) on my not-so-speedy
>>             laptop.
>>
>>                  library(microbenchmark)
>>                  library(RNAseqData.HNRNPC.bam.____chr14)
>>                  fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES
>>
>>                            microbenchmark(____readGAlignments(fls[1]),
>>             times=5)
>>
>>                      Unit: seconds
>>                                           expr      min       lq  median
>>                    uq
>>                        max neval
>>                        readGAlignments(fls[1]) 1.753653 1.788292 1.85527
>>             2.019071
>>                      2.121211     5
>>
>>
>>
>>                  ## mate pairing
>>
>>
>>             microbenchmark(____readGAlignmentPairs(fls[1]), times=5)
>>
>>                      Unit: seconds
>>                                               expr      min       lq
>>             median
>>                      uq      max neval
>>                        readGAlignmentPairs(fls[1]) 9.283966 9.300328
>>             9.535194
>>                      9.843282 11.37441     5
>>
>>
>>                  ## mate pairing
>>
>>
>>             microbenchmark(____readGAlignmentsList(fls[1]), times=5)
>>
>>                      Unit: seconds
>>                                               expr      min       lq
>>             median
>>                      uq      max neval
>>                        readGAlignmentsList(fls[1]) 8.409743 8.457232
>>             8.803696
>>                      9.692845 9.867628     5
>>
>>
>>
>>                  You can set a 'yieldSize' when creating the BamFile.
>>             This will chunk
>>                  through the file 'yieldSize' records at a time and is
>>             useful when
>>                  processing a complete file and when there are memory
>>             limitations.
>>
>>                     bf <- BamFile(myfile, yieldSize = 1000000)
>>
>>                  Herve reported some additional timings for
>>             readGAlignmentPairs() on
>>                  this post:
>>
>>
>> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/059695.html
>>
>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html>
>>
>>
>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>>
>> <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>>
>>
>>
>>                  Can you provide some numbers for how long your run is
>>             taking and the
>>                  number of records you're trying to read in?
>>
>>                  I assume 'allExons' is used for counting against the
>>             BAM files in
>>                  later code? Maybe this is why your were looking into
>>                  summarizeOverlaps(). I should point out that if
>> you're only
>>                  interested in the regions overlapping 'allExons' you
>>             could call
>>                  range(allExons) and use that as the 'which' in the
>>             param. This would
>>                  focus the area of interest and speed up the reading.
>>
>>                  Valerie
>>
>>
>>
>>
>>                  On 06/30/14 12:22, Fong Chun Chan wrote:
>>
>>                      Hi,
>>
>>                      I've been using the GenomicFeatures R package to
>>             read in some
>>                      RNA-seq
>>                      paired-end read data. While the readGAlignments()
>>             function reads
>>                      in the bam
>>                      file within a minute, I've noticed that the
>>                      readGAlignmentPairs() function
>>                      is extremely slow.
>>
>>                      This is even after restricting the space to just a
>>             single
>>                      chromosome along
>>                      with setting the flags isDuplicate = FALSE and
>>             isNotPrimaryRead
>>                      = FALSE
>>                      (the latter being suggested in the reference):
>>
>>                      "An easy way to avoid this degradation is to load
>>             only primary
>>                      alignments
>>                      by setting the isNotPrimaryRead ï¬‚ag to FALSE in
>>             ScanBamParam()"
>>
>>                      Based on my understanding, it seems that the ï¬
>>                      ndMateAlignment() function
>>
>>                      has to be run first. Is there any other way to
>>             speed it up?
>>                      Perhaps I am
>>                      doing something wrong here?
>>
>>                      Code and sessionInfo() below.
>>
>>                      Thanks!
>>
>>                      ----
>>                      library("GenomicFeatures")
>>                      library('GenomicAlignments')
>>                      library('Rsamtools')
>>                      si <- seqinfo(BamFile(arguments$____bamFile))
>>                      gr <- GRanges(arguments$chr, IRanges(100,
>>                      seqlengths(si)[[arguments$chr]____]-100))
>>                      allExons <- exons(txdb, columns = c('gene_id',
>>             'exon_id',
>>                      'exon_name'))
>>                      scf <- scanBamFlag( isDuplicate = FALSE,
>>             isNotPrimaryRead =
>>                      FALSE ) #
>>                      remove duplicate reads, remove non-primary reads
>> (i.e.
>>                      multi-aligned reads)
>>                      reads <- readGAlignmentPairs( arguments$bamFile,
>>             param =
>>                      ScanBamParam(
>>                      which = gr, flag = scf ) ); # grab reads in
>>             specific region
>>
>>                          sessionInfo()
>>
>>                      R version 3.1.0 (2014-04-10)
>>                      Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>                      locale:
>>                         [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>                      LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>                      LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>                        LC_PAPER=en_US.UTF-8
>>                              LC_NAME=C                  LC_ADDRESS=C
>>                      LC_TELEPHONE=C
>>                                    LC_MEASUREMENT=en_US.UTF-8
>>             LC_IDENTIFICATION=C
>>
>>                      attached base packages:
>>                      [1] parallel  stats     graphics  grDevices utils
>>                datasets
>>                        methods
>>                      base
>>
>>                      other attached packages:
>>                         [1] argparse_1.0.1          proto_0.3-10
>>                         GenomicAlignments_1.0.1 BSgenome_1.32.0
>>             Rsamtools_1.16.1
>>                         Biostrings_2.32.0       XVector_0.4.0
>>                      GenomicFeatures_1.16.2
>>                         AnnotationDbi_1.26.0    Biobase_2.24.0
>>                        GenomicRanges_1.16.3
>>                         GenomeInfoDb_1.0.2      IRanges_1.22.9
>>                        BiocGenerics_0.10.0
>>                      vimcom.plus_0.9-93
>>                      [16] setwidth_1.0-3          colorout_1.0-2
>>
>>                      loaded via a namespace (and not attached):
>>                         [1] BatchJobs_1.2      BBmisc_1.7
>>             BiocParallel_0.6.1
>>                      biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>>                      checkmate_1.0
>>                         codetools_0.2-8    DBI_0.2-7
>> digest_0.6.4
>>             fail_1.2
>>                          findpython_1.0.1   foreach_1.4.2
>> getopt_1.20.0
>>                        iterators_1.0.7
>>                           plyr_1.8.1         Rcpp_0.11.2
>>               RCurl_1.95-4.1
>>                      [19] rjson_0.2.14       RSQLite_0.11.4
>>             rtracklayer_1.24.2
>>                      sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2
>>             tools_3.1.0
>>                         XML_3.98-1.1       zlibbioc_1.10.0
>>
>>
>>
>>                               [[alternative HTML version deleted]]
>>
>>
>>
>>                      ___________________________________________________
>>                      Bioconductor mailing list
>>             Bioconductor at r-project.org
>>             <mailto:Bioconductor at r-project.org>
>>             <mailto:Bioconductor at r-__project.org
>>             <mailto:Bioconductor at r-project.org>>
>>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>>
>>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>>                      Search the archives:
>>
>>
>> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>>
>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>>
>>
>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>>
>>
>>
>>                  --
>>                  Valerie Obenchain
>>                  Program in Computational Biology
>>                  Fred Hutchinson Cancer Research Center
>>                  1100 Fairview Ave. N, Seattle, WA 98109
>>
>>                  Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>             <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>>                  Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>>             <tel:%28206%29%20667-3158>
>>
>>
>>
>>         _________________________________________________
>>         Bioconductor mailing list
>>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>         Search the archives:
>>
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor