[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()
Valerie Obenchain
vobencha at fhcrc.org
Thu Jul 3 18:40:24 CEST 2014
Correction below.
On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
> Hi,
>
> On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>> Hi Valerie,
>>
>> Thanks for the reply. I'll post some runtime as soon as it finishes
>> running the readGAlignments and readGAlignmentPairs.
>>
>> Thanks for the tip on passing range(allExons) into the which param.
>> There is only one issue in that if I wanted to then generate say RPKM
>> values where I need to know the "total number of aligned reads" in a bam
>> file. Because I've read in just a subset of the bam (i.e. the reads that
>> aligned just to exons) I would have technically produced an incorrection
>> calculation of RPKM values (assuming you normalize using the number of
>> aligned reads). On that note, is there a quick and easy way to just
>> count the number of aligned reads or aligned paired reads so I can
>> quickly normalize by this way. This way I can still use the which
>> parameter in readGAlignments and readGAlignmentPairs to speed it up?
>
> For single-end you can use countBam():
>
> > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE))
> > countBam(fl, param = param)
> space start end width file records nucleotides
> 1 NA NA NA NA ex1.bam 3271 115286
>
> It's not so clear for paired-end. The mate-pairing algorithm used in the
> readGAlignment* functions is described on the man page and uses a
> combination of BAM fields and flags. You can't really get at this simply
> by creating a param with a flag combination such as
>
> isPaired = TRUE
> isProperPair = TRUE
> isUnmappedQuery = FALSE
> hasUnmappedMate = FALSE
>
> For example, if you're interested in this region
>
> gr <- GRanges("seq1", IRanges(100, 500))
>
> and set flags,
>
> flags <- scanBamFlag(isPaired = TRUE,
> isProperPair = TRUE,
> isUnmappedQuery = FALSE,
> hasUnmappedMate = FALSE)
>
> countBam() reports 335 records. These are individual records so the mate
> pairs (parts) are each counted separately. Only records that fall within
> the region defined by 'gr' are counted.
>
> > param <- ScanBamParam(which = gr, flag = flags)
> > countBam(fl, param = param)
> space start end width file records nucleotides
> 1 seq1 100 500 401 ex1.bam 335 11785
>
>
> readGAlignmentPairs() reports 223 pairs (446 individual records). All
> records in 'gr' are mated if possible. In the case where one mate part
> falls inside 'gr' and one outside, the other mate will be retrieved
> (i.e., outside the range of 'gr'). This makes the 335 above confusing
> for counting the number of aligned pairs in the region. The number is a
> mixture of (1) mate parts both in 'gr' that belong together (2) mate
> parts with mate outside 'gr' and (3) mates or mate parts that don't meet
> other mate-pairing criteria.
>
> > galp <- readGAlignmentPairs(fl, param=param)
> > length(galp)
> [1] 223
>
> Off hand I can't think of a good way to count aligned pairs w/in a
> specified region.
What I should have said here is I can't think of a good way to
accurately count aligned pairs (1 count per pair) in a BAM file w/o
using the pairing algo.
Valerie
>
>>
>> Also another thing I notice is that when I use the
>> readGAlignmentPairs function, I lose my ability to cancel the command
>> (ctrl-c no longer works).
>
> It's implemented in C/C++. You can try ctrl+\ in the session or kill
> with the top command from another terminal. Maybe others have
> suggestions they'll add ...
>
> Valerie
>
> Not sure this is a bug, or if it is something
>> to do with the internal workings of the function. The loss of ctrl-c
>> also occurs when trying to run as a Rscript. I can seemingly ctrl-c
>> anytime before, but it hits this step it just refuses to cancel.
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>> Hi,
>>
>> You don't need to call findMateAlignment() when using
>> readGAlignmentPairs(). In a previous iteration,
>> readGAlignmentPairs() called findMateAlignment() but this is no
>> longer the case.
>>
>> readGAlignmentPairs() does have more overhead than readGAlignments()
>> because of the mate paring. These numbers are for a BAM file with
>> 800484 records (400054 final pairs) on my not-so-speedy laptop.
>>
>> library(microbenchmark)
>> library(RNAseqData.HNRNPC.bam.__chr14)
>> fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES
>>
>> microbenchmark(__readGAlignments(fls[1]), times=5)
>>
>> Unit: seconds
>> expr min lq median uq
>> max neval
>> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071
>> 2.121211 5
>>
>>
>>
>> ## mate pairing
>>
>> microbenchmark(__readGAlignmentPairs(fls[1]), times=5)
>>
>> Unit: seconds
>> expr min lq median
>> uq max neval
>> readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194
>> 9.843282 11.37441 5
>>
>>
>> ## mate pairing
>>
>> microbenchmark(__readGAlignmentsList(fls[1]), times=5)
>>
>> Unit: seconds
>> expr min lq median
>> uq max neval
>> readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696
>> 9.692845 9.867628 5
>>
>>
>>
>> You can set a 'yieldSize' when creating the BamFile. This will chunk
>> through the file 'yieldSize' records at a time and is useful when
>> processing a complete file and when there are memory limitations.
>>
>> bf <- BamFile(myfile, yieldSize = 1000000)
>>
>> Herve reported some additional timings for readGAlignmentPairs() on
>> this post:
>>
>> https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>> <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>
>>
>>
>> Can you provide some numbers for how long your run is taking and the
>> number of records you're trying to read in?
>>
>> I assume 'allExons' is used for counting against the BAM files in
>> later code? Maybe this is why your were looking into
>> summarizeOverlaps(). I should point out that if you're only
>> interested in the regions overlapping 'allExons' you could call
>> range(allExons) and use that as the 'which' in the param. This would
>> focus the area of interest and speed up the reading.
>>
>> Valerie
>>
>>
>>
>>
>> On 06/30/14 12:22, Fong Chun Chan wrote:
>>
>> Hi,
>>
>> I've been using the GenomicFeatures R package to read in some
>> RNA-seq
>> paired-end read data. While the readGAlignments() function reads
>> in the bam
>> file within a minute, I've noticed that the
>> readGAlignmentPairs() function
>> is extremely slow.
>>
>> This is even after restricting the space to just a single
>> chromosome along
>> with setting the flags isDuplicate = FALSE and isNotPrimaryRead
>> = FALSE
>> (the latter being suggested in the reference):
>>
>> "An easy way to avoid this degradation is to load only primary
>> alignments
>> by setting the isNotPrimaryRead flag to FALSE in ScanBamParam()"
>>
>> Based on my understanding, it seems that the ï¬
>> ndMateAlignment() function
>>
>> has to be run first. Is there any other way to speed it up?
>> Perhaps I am
>> doing something wrong here?
>>
>> Code and sessionInfo() below.
>>
>> Thanks!
>>
>> ----
>> library("GenomicFeatures")
>> library('GenomicAlignments')
>> library('Rsamtools')
>> si <- seqinfo(BamFile(arguments$__bamFile))
>> gr <- GRanges(arguments$chr, IRanges(100,
>> seqlengths(si)[[arguments$chr]__]-100))
>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id',
>> 'exon_name'))
>> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead =
>> FALSE ) #
>> remove duplicate reads, remove non-primary reads (i.e.
>> multi-aligned reads)
>> reads <- readGAlignmentPairs( arguments$bamFile, param =
>> ScanBamParam(
>> which = gr, flag = scf ) ); # grab reads in specific region
>>
>> sessionInfo()
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> LC_PAPER=en_US.UTF-8
>> LC_NAME=C LC_ADDRESS=C
>> LC_TELEPHONE=C
>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
>> methods
>> base
>>
>> other attached packages:
>> [1] argparse_1.0.1 proto_0.3-10
>> GenomicAlignments_1.0.1 BSgenome_1.32.0
>> Rsamtools_1.16.1
>> Biostrings_2.32.0 XVector_0.4.0
>> GenomicFeatures_1.16.2
>> AnnotationDbi_1.26.0 Biobase_2.24.0
>> GenomicRanges_1.16.3
>> GenomeInfoDb_1.0.2 IRanges_1.22.9
>> BiocGenerics_0.10.0
>> vimcom.plus_0.9-93
>> [16] setwidth_1.0-3 colorout_1.0-2
>>
>> loaded via a namespace (and not attached):
>> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1
>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6
>> checkmate_1.0
>> codetools_0.2-8 DBI_0.2-7 digest_0.6.4
>> fail_1.2
>> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0
>> iterators_1.0.7
>> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
>> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2
>> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2
>> tools_3.1.0
>> XML_3.98-1.1 zlibbioc_1.10.0
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>> --
>> Valerie Obenchain
>> Program in Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, Seattle, WA 98109
>>
>> Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list