[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files
Valerie Obenchain
vobencha at fhcrc.org
Thu Mar 27 16:30:02 CET 2014
Hi Mike,
This is fixed in Rsamtools 1.15.35.
The bug was related to when the mate-pairing was performed wrt meeting
the 'yieldSize' requirement. Thanks for sending the file and
reproducible example.
The file has ~115 million records:
fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>> countBam(fl)$records
> [1] 114943975
To process the complete file with a yield size of 1e6 took ~ 18 GIG and
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)
bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)
>> system.time(SO(bf))
> user system elapsed
> 1545.684 12.412 1558.498
Thanks for reporting the bug.
Valerie
On 03/21/14 13:55, Michael Love wrote:
> hi Valerie,
>
> Thanks. I'm trying now to make use of the new mate pairing algorithm but
> keeping running out of memory (i requested 50 Gb) and getting my job
> killed. I wonder if you could try this code/example below?
>
> If the new C code is faster for paired-end than the
> pre-sorting/obeyQname method that would be great (eliminates the need to
> have the extra qname-sorted Bam), but it seems to me that with the old
> method, it was easier to specify hard limits on memory. Maybe I am
> missing something though. :)
>
> Here's an RNA-Seq sample from Encode, and then I run samtools index locally.
>
> from:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
>
> I download the hg19 paired end reads:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam
>
> library("GenomicFeatures")
> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> grl <- exonsBy(txdb, by="gene")
> library("Rsamtools")
> bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
> bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
> library("GenomicAlignments")
> system.time({so <- summarizeOverlaps(grl, bf,
> ignore.strand=TRUE,
> singleEnd=FALSE)})
>
>
>
> > sessionInfo()
> R Under development (unstable) (2014-03-18 r65213)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
> [2] GenomicFeatures_1.15.11
> [3] AnnotationDbi_1.25.15
> [4] Biobase_2.23.6
> [5] GenomicAlignments_0.99.32
> [6] BSgenome_1.31.12
> [7] Rsamtools_1.15.33
> [8] Biostrings_2.31.14
> [9] XVector_0.3.7
> [10] GenomicRanges_1.15.39
> [11] GenomeInfoDb_0.99.19
> [12] IRanges_1.21.34
> [13] BiocGenerics_0.9.3
> [14] Defaults_1.1-1
> [15] devtools_1.4.1
> [16] knitr_1.5
> [17] BiocInstaller_1.13.3
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.5.17
> [4] biomaRt_2.19.3 bitops_1.0-6 brew_1.0-6
> [7] codetools_0.2-8 DBI_0.2-7 digest_0.6.4
> [10] evaluate_0.5.1 fail_1.2 foreach_1.4.1
> [13] formatR_0.10 httr_0.2 iterators_1.0.6
> [16] memoise_0.1 plyr_1.8.1 Rcpp_0.11.1
> [19] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.23.18
> [22] sendmailR_1.1-2 stats4_3.2.0 stringr_0.6.2
> [25] tools_3.2.0 whisker_0.3-2 XML_3.98-1.1
> [28] zlibbioc_1.9.0
> >
>
>
>
>
>
> On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> On 03/19/14 10:24, Michael Love wrote:
>
> hi Valerie,
>
> If the Bam is not sorted by name, isn't it possible that
> readGAlignment*
> will load > yieldSize number of reads in order to find the mate?
>
>
> Sorry, our emails keep criss-crossing.
>
> Because the mate-pairing is now done in C yieldSize is no longer a
> constraint.
>
> When yieldSize is given, say 100, then 100 mates are returned. The
> algo moves through the file until 100 records are successfully
> paired. These 100 are yielded to the user and the code picks up
> where it left off. A related situation is the 'which' in a param. In
> this case you want the mates in a particular range. The algo moves
> through the range and pairs what it can. If it's left with unmated
> records it goes outside the range looking for them.
> readGAlignmentsList() will return all records in this range, mated
> or not. The metadata column of 'mate_status' indicates the different
> groupings.
>
> Valerie
>
>
>
> Mike
>
>
> On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
> Hi Mike,
>
> You no longer need to sort Bam files to use the pairing algo or
> yieldSize. The readGAlignment* functions now work with both
> constraints out of the box.
>
> Create a BamFile with yieldSize and indicate you want mates.
> bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>
> Maybe set some specifications in a param:
> param <- ScanBamParam(what = c("qname", "flag"))
>
> Then call either readGAlignment* method that handles pairs:
> readGAlignmentsList(bf, param=param)
> readGAlignmentPairs(bf, param=param)
>
> For summarizeOverlaps():
> summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)
>
> We've considered removing the 'obeyQname' arg and
> documentation but
> thought the concept may be useful in another application. I'll
> revisit the summarizeOverlaps() documentation to make sure
> 'obeyQname' is downplayed and 'asMates' is encouraged.
>
> Valerie
>
>
>
>
> On 03/19/14 07:39, Michael Love wrote:
>
> hi,
>
> From last year, in order to use yieldSize with
> paired-end
> BAMs, I
>
> should sort the BAMs by qname and then use the
> following call to
> BamFile:
>
> library(pasillaBamSubset)
> fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
> bf <- BamFile(fl, index=character(0), yieldSize=3,
> obeyQname=TRUE)
>
> https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html
> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>
>
> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
> <https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>
>
> If I want to use
> GenomicAlignments::____readGAlignmentsList with
>
> asMates=TRUE and respecting the yieldSize, what is the
> proper
> construction? (in the end, I want to use
> summarizeOverlaps on
> paired-end BAMs while respecting the yieldSize)
>
> library(pasillaBamSubset)
> fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
> bf <- BamFile(fl, index=character(0), yieldSize=3,
> obeyQname=TRUE, asMates=TRUE)
> x <- readGAlignmentsList(bf)
> Warning message:
> In scanBam(bamfile, ..., param = param) :
> 'obeyQname=TRUE' ignored when 'asMates=TRUE'
> Calls: readGAlignmentsList ... .matesFromBam ->
> .load_bamcols_from_bamfile -> scanBam -> scanBam
>
> I see in the man pages for summarizeOverlaps it has:
>
> "In Bioconductor > 2.12 it is not
> necessary to sort paired-end BAM files by ‘qname’. When
> counting with ‘summarizeOverlaps’, setting
> ‘singleEnd=FALSE’
> will trigger paired-end reading and counting."
>
> but I don't see how this can respect the specified
> yieldSize,
> because
> readGAlignmentsList has to read in as many reads as
> necessary to
> find
> the mate.
>
> Sorry in advance if I am missing something in the
> documentation!
>
> Mike
>
> ___________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> mailing list
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>
More information about the Bioc-devel
mailing list