[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()
Valerie Obenchain
vobencha at fhcrc.org
Thu Jul 3 21:11:18 CEST 2014
fyi the test files ('fls') came from,
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
Valerie
On 07/03/2014 10:42 AM, Valerie Obenchain wrote:
> No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved
> file. Files with more diversity may have more than 2 segments per template.
>
> counts <- lapply(fls, function(fl)
> countBam(fl, param=param)$records)
>
> pairs <- lapply(fls, function(fl)
> length(readGAlignmentPairs(fl, param=param)))
>
>>> unlist(counts)/2
>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>> ERR127305
>> 363326 392420 397158 349970 355376 381493
>> 389176 355834
>>> unlist(pairs)
>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>> ERR127305
>> 363147 392252 396933 349822 355215 381274
>> 388991 355682
>
> To see what's going on use readGAlignementsList() on the first file.
> This function adds some flexibility in that it reads in fragments, pairs
> w/ no mates, etc.
>
> galist <- readGAlignmentsList(fl, param=param)
>
> The length matches that of countBam() because it reads all records.
>
>>> length(galist)
>> [1] 363226
>
> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'.
> Here we see the 363147 match to output from readGAlignmentsPairs() and
> the rest are ambiguous.
>
>>> table(mcols(galist)$mate_status)
>>
>> mated ambiguous unmated
>> 363147 79 0
>
> elementLengths() shows the number of segments per template. Again we see
> the 363147 match and it shows exactly 2 segments for those.
>
>>> table(elementLengths(galist))
>>
>> 2 4 6 8
>> 363147 65 7 7
>
>
> Valerie
>
>
> On 07/03/2014 10:08 AM, Fong Chun Chan wrote:
>> Thanks for the reply. I understand what you mean about counting aligned
>> pairs within a region and how that would give incorrect results.
>>
>> However, if I was purely interested in just getting all aligned
>> read-pairs within the whole genomic space (no restriction to any region
>> since this is what I need to normalize by), wouldn't the following code
>> work?
>>
>> library('Rsamtools')
>> library('GenomicAlignments')
>>
>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>>
>> flags <- scanBamFlag(isPaired = TRUE,
>> isProperPair = TRUE,
>> isUnmappedQuery = FALSE,
>> hasUnmappedMate = FALSE)
>>
>> param <- ScanBamParam(flag = flags)
>>
>> countBam(fl, param = param)[['records']]/2
>> [1]1572
>>
>> galp <- readGAlignmentPairs(fl, param=param)
>> length(galp)
>> [1]1572
>>
>>
>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>> Correction below.
>>
>>
>> On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
>>
>> Hi,
>>
>> On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>>
>> Hi Valerie,
>>
>> Thanks for the reply. I'll post some runtime as soon as it
>> finishes
>> running the readGAlignments and readGAlignmentPairs.
>>
>> Thanks for the tip on passing range(allExons) into the which
>> param.
>> There is only one issue in that if I wanted to then generate
>> say RPKM
>> values where I need to know the "total number of aligned
>> reads" in a bam
>> file. Because I've read in just a subset of the bam (i.e.
>> the reads that
>> aligned just to exons) I would have technically produced an
>> incorrection
>> calculation of RPKM values (assuming you normalize using the
>> number of
>> aligned reads). On that note, is there a quick and easy way
>> to just
>> count the number of aligned reads or aligned paired reads so
>> I can
>> quickly normalize by this way. This way I can still use the
>> which
>> parameter in readGAlignments and readGAlignmentPairs to
>> speed it up?
>>
>>
>> For single-end you can use countBam():
>>
>> > fl <- system.file("extdata", "ex1.bam", package =
>> "Rsamtools")
>> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery =
>> FALSE))
>> > countBam(fl, param = param)
>> space start end width file records nucleotides
>> 1 NA NA NA NA ex1.bam 3271 115286
>>
>> It's not so clear for paired-end. The mate-pairing algorithm
>> used in the
>> readGAlignment* functions is described on the man page and uses a
>> combination of BAM fields and flags. You can't really get at
>> this simply
>> by creating a param with a flag combination such as
>>
>> isPaired = TRUE
>> isProperPair = TRUE
>> isUnmappedQuery = FALSE
>> hasUnmappedMate = FALSE
>>
>> For example, if you're interested in this region
>>
>> gr <- GRanges("seq1", IRanges(100, 500))
>>
>> and set flags,
>>
>> flags <- scanBamFlag(isPaired = TRUE,
>> isProperPair = TRUE,
>> isUnmappedQuery = FALSE,
>> hasUnmappedMate = FALSE)
>>
>> countBam() reports 335 records. These are individual records so
>> the mate
>> pairs (parts) are each counted separately. Only records that
>> fall within
>> the region defined by 'gr' are counted.
>>
>> > param <- ScanBamParam(which = gr, flag = flags)
>> > countBam(fl, param = param)
>> space start end width file records nucleotides
>> 1 seq1 100 500 401 ex1.bam 335 11785
>>
>>
>> readGAlignmentPairs() reports 223 pairs (446 individual
>> records). All
>> records in 'gr' are mated if possible. In the case where one
>> mate part
>> falls inside 'gr' and one outside, the other mate will be
>> retrieved
>> (i.e., outside the range of 'gr'). This makes the 335 above
>> confusing
>> for counting the number of aligned pairs in the region. The
>> number is a
>> mixture of (1) mate parts both in 'gr' that belong together
>> (2) mate
>> parts with mate outside 'gr' and (3) mates or mate parts that
>> don't meet
>> other mate-pairing criteria.
>>
>> > galp <- readGAlignmentPairs(fl, param=param)
>> > length(galp)
>> [1] 223
>>
>> Off hand I can't think of a good way to count aligned pairs
>> w/in a
>> specified region.
>>
>>
>> What I should have said here is I can't think of a good way to
>> accurately count aligned pairs (1 count per pair) in a BAM file w/o
>> using the pairing algo.
>>
>> Valerie
>>
>>
>>
>> Also another thing I notice is that when I use the
>> readGAlignmentPairs function, I lose my ability to cancel
>> the command
>> (ctrl-c no longer works).
>>
>>
>> It's implemented in C/C++. You can try ctrl+\ in the session
>> or kill
>> with the top command from another terminal. Maybe others have
>> suggestions they'll add ...
>>
>> Valerie
>>
>> Not sure this is a bug, or if it is something
>>
>> to do with the internal workings of the function. The loss
>> of ctrl-c
>> also occurs when trying to run as a Rscript. I can seemingly
>> ctrl-c
>> anytime before, but it hits this step it just refuses to
>> cancel.
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain
>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>
>> wrote:
>>
>> Hi,
>>
>> You don't need to call findMateAlignment() when using
>> readGAlignmentPairs(). In a previous iteration,
>> readGAlignmentPairs() called findMateAlignment() but
>> this is no
>> longer the case.
>>
>> readGAlignmentPairs() does have more overhead than
>> readGAlignments()
>> because of the mate paring. These numbers are for a BAM
>> file with
>> 800484 records (400054 final pairs) on my not-so-speedy
>> laptop.
>>
>> library(microbenchmark)
>> library(RNAseqData.HNRNPC.bam.____chr14)
>> fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES
>>
>> microbenchmark(____readGAlignments(fls[1]),
>> times=5)
>>
>> Unit: seconds
>> expr min lq median
>> uq
>> max neval
>> readGAlignments(fls[1]) 1.753653 1.788292 1.85527
>> 2.019071
>> 2.121211 5
>>
>>
>>
>> ## mate pairing
>>
>>
>> microbenchmark(____readGAlignmentPairs(fls[1]), times=5)
>>
>> Unit: seconds
>> expr min lq
>> median
>> uq max neval
>> readGAlignmentPairs(fls[1]) 9.283966 9.300328
>> 9.535194
>> 9.843282 11.37441 5
>>
>>
>> ## mate pairing
>>
>>
>> microbenchmark(____readGAlignmentsList(fls[1]), times=5)
>>
>> Unit: seconds
>> expr min lq
>> median
>> uq max neval
>> readGAlignmentsList(fls[1]) 8.409743 8.457232
>> 8.803696
>> 9.692845 9.867628 5
>>
>>
>>
>> You can set a 'yieldSize' when creating the BamFile.
>> This will chunk
>> through the file 'yieldSize' records at a time and is
>> useful when
>> processing a complete file and when there are memory
>> limitations.
>>
>> bf <- BamFile(myfile, yieldSize = 1000000)
>>
>> Herve reported some additional timings for
>> readGAlignmentPairs() on
>> this post:
>>
>>
>> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/059695.html
>>
>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html>
>>
>>
>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>>
>> <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>>
>>
>>
>> Can you provide some numbers for how long your run is
>> taking and the
>> number of records you're trying to read in?
>>
>> I assume 'allExons' is used for counting against the
>> BAM files in
>> later code? Maybe this is why your were looking into
>> summarizeOverlaps(). I should point out that if
>> you're only
>> interested in the regions overlapping 'allExons' you
>> could call
>> range(allExons) and use that as the 'which' in the
>> param. This would
>> focus the area of interest and speed up the reading.
>>
>> Valerie
>>
>>
>>
>>
>> On 06/30/14 12:22, Fong Chun Chan wrote:
>>
>> Hi,
>>
>> I've been using the GenomicFeatures R package to
>> read in some
>> RNA-seq
>> paired-end read data. While the readGAlignments()
>> function reads
>> in the bam
>> file within a minute, I've noticed that the
>> readGAlignmentPairs() function
>> is extremely slow.
>>
>> This is even after restricting the space to just a
>> single
>> chromosome along
>> with setting the flags isDuplicate = FALSE and
>> isNotPrimaryRead
>> = FALSE
>> (the latter being suggested in the reference):
>>
>> "An easy way to avoid this degradation is to load
>> only primary
>> alignments
>> by setting the isNotPrimaryRead flag to FALSE in
>> ScanBamParam()"
>>
>> Based on my understanding, it seems that the ï¬
>> ndMateAlignment() function
>>
>> has to be run first. Is there any other way to
>> speed it up?
>> Perhaps I am
>> doing something wrong here?
>>
>> Code and sessionInfo() below.
>>
>> Thanks!
>>
>> ----
>> library("GenomicFeatures")
>> library('GenomicAlignments')
>> library('Rsamtools')
>> si <- seqinfo(BamFile(arguments$____bamFile))
>> gr <- GRanges(arguments$chr, IRanges(100,
>> seqlengths(si)[[arguments$chr]____]-100))
>> allExons <- exons(txdb, columns = c('gene_id',
>> 'exon_id',
>> 'exon_name'))
>> scf <- scanBamFlag( isDuplicate = FALSE,
>> isNotPrimaryRead =
>> FALSE ) #
>> remove duplicate reads, remove non-primary reads
>> (i.e.
>> multi-aligned reads)
>> reads <- readGAlignmentPairs( arguments$bamFile,
>> param =
>> ScanBamParam(
>> which = gr, flag = scf ) ); # grab reads in
>> specific region
>>
>> sessionInfo()
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> LC_PAPER=en_US.UTF-8
>> LC_NAME=C LC_ADDRESS=C
>> LC_TELEPHONE=C
>> LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils
>> datasets
>> methods
>> base
>>
>> other attached packages:
>> [1] argparse_1.0.1 proto_0.3-10
>> GenomicAlignments_1.0.1 BSgenome_1.32.0
>> Rsamtools_1.16.1
>> Biostrings_2.32.0 XVector_0.4.0
>> GenomicFeatures_1.16.2
>> AnnotationDbi_1.26.0 Biobase_2.24.0
>> GenomicRanges_1.16.3
>> GenomeInfoDb_1.0.2 IRanges_1.22.9
>> BiocGenerics_0.10.0
>> vimcom.plus_0.9-93
>> [16] setwidth_1.0-3 colorout_1.0-2
>>
>> loaded via a namespace (and not attached):
>> [1] BatchJobs_1.2 BBmisc_1.7
>> BiocParallel_0.6.1
>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6
>> checkmate_1.0
>> codetools_0.2-8 DBI_0.2-7
>> digest_0.6.4
>> fail_1.2
>> findpython_1.0.1 foreach_1.4.2
>> getopt_1.20.0
>> iterators_1.0.7
>> plyr_1.8.1 Rcpp_0.11.2
>> RCurl_1.95-4.1
>> [19] rjson_0.2.14 RSQLite_0.11.4
>> rtracklayer_1.24.2
>> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2
>> tools_3.1.0
>> XML_3.98-1.1 zlibbioc_1.10.0
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ___________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:Bioconductor at r-project.org>
>> <mailto:Bioconductor at r-__project.org
>> <mailto:Bioconductor at r-project.org>>
>> https://stat.ethz.ch/mailman/____listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>>
>> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>> Search the archives:
>>
>>
>> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>>
>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>>
>>
>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>>
>>
>>
>> --
>> Valerie Obenchain
>> Program in Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, Seattle, WA 98109
>>
>> Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>> <tel:%28206%29%20667-3158>
>>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list