[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files
Hervé Pagès
hpages at fhcrc.org
Thu Mar 27 19:36:48 CET 2014
Hi Val,
On 03/27/2014 09:13 AM, Valerie Obenchain wrote:
> I should also mention that when both 'yieldSize' in the BamFile and
> 'which' in ScanBamParam are set the 'which' gets priority. The point of
> 'yieldSize' is to provide a reasonable chunk for processing the file.
> When 'which' is provided it's assumed that range is of reasonable chunk
> size so the yield is ignored.
Note that more than 1 range can be specified thru 'which'.
What about emitting a warning when 'yieldSize' is ignored?
>
> I've added this info to the 'summarizeOverlaps' and 'readGAlignments'
> man pages in GenomicAlignments.
Is this a property of scanBam()? If so then maybe it should be
documented in the man page for scanBam(). I just had a quick look
and was not able to find it, sorry if I missed it.
summarizeOverlaps(), readGAlignments(), and probably other tools,
just rely on scanBam().
Thanks,
H.
>
> Valerie
>
> On 03/27/14 08:30, Valerie Obenchain wrote:
>> Hi Mike,
>>
>> This is fixed in Rsamtools 1.15.35.
>>
>> The bug was related to when the mate-pairing was performed wrt meeting
>> the 'yieldSize' requirement. Thanks for sending the file and
>> reproducible example.
>>
>> The file has ~115 million records:
>>
>> fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>>> countBam(fl)$records
>>> [1] 114943975
>>
>> To process the complete file with a yield size of 1e6 took ~ 18 GIG and
>> 25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)
>>
>> bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
>> grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
>> SO <- function(x)
>> summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)
>>
>>>> system.time(SO(bf))
>>> user system elapsed
>>> 1545.684 12.412 1558.498
>>
>> Thanks for reporting the bug.
>>
>> Valerie
>>
>>
>>
>> On 03/21/14 13:55, Michael Love wrote:
>>> hi Valerie,
>>>
>>> Thanks. I'm trying now to make use of the new mate pairing algorithm but
>>> keeping running out of memory (i requested 50 Gb) and getting my job
>>> killed. I wonder if you could try this code/example below?
>>>
>>> If the new C code is faster for paired-end than the
>>> pre-sorting/obeyQname method that would be great (eliminates the need to
>>> have the extra qname-sorted Bam), but it seems to me that with the old
>>> method, it was easier to specify hard limits on memory. Maybe I am
>>> missing something though. :)
>>>
>>> Here's an RNA-Seq sample from Encode, and then I run samtools index
>>> locally.
>>>
>>> from:
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
>>>
>>>
>>>
>>> I download the hg19 paired end reads:
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam
>>>
>>>
>>>
>>> library("GenomicFeatures")
>>> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>> grl <- exonsBy(txdb, by="gene")
>>> library("Rsamtools")
>>> bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>> bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
>>> library("GenomicAlignments")
>>> system.time({so <- summarizeOverlaps(grl, bf,
>>> ignore.strand=TRUE,
>>> singleEnd=FALSE)})
>>>
>>>
>>>
>>> > sessionInfo()
>>> R Under development (unstable) (2014-03-18 r65213)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
>>> [2] GenomicFeatures_1.15.11
>>> [3] AnnotationDbi_1.25.15
>>> [4] Biobase_2.23.6
>>> [5] GenomicAlignments_0.99.32
>>> [6] BSgenome_1.31.12
>>> [7] Rsamtools_1.15.33
>>> [8] Biostrings_2.31.14
>>> [9] XVector_0.3.7
>>> [10] GenomicRanges_1.15.39
>>> [11] GenomeInfoDb_0.99.19
>>> [12] IRanges_1.21.34
>>> [13] BiocGenerics_0.9.3
>>> [14] Defaults_1.1-1
>>> [15] devtools_1.4.1
>>> [16] knitr_1.5
>>> [17] BiocInstaller_1.13.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BatchJobs_1.2 BBmisc_1.5 BiocParallel_0.5.17
>>> [4] biomaRt_2.19.3 bitops_1.0-6 brew_1.0-6
>>> [7] codetools_0.2-8 DBI_0.2-7 digest_0.6.4
>>> [10] evaluate_0.5.1 fail_1.2 foreach_1.4.1
>>> [13] formatR_0.10 httr_0.2 iterators_1.0.6
>>> [16] memoise_0.1 plyr_1.8.1 Rcpp_0.11.1
>>> [19] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.23.18
>>> [22] sendmailR_1.1-2 stats4_3.2.0 stringr_0.6.2
>>> [25] tools_3.2.0 whisker_0.3-2 XML_3.98-1.1
>>> [28] zlibbioc_1.9.0
>>> >
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>> On 03/19/14 10:24, Michael Love wrote:
>>>
>>> hi Valerie,
>>>
>>> If the Bam is not sorted by name, isn't it possible that
>>> readGAlignment*
>>> will load > yieldSize number of reads in order to find the mate?
>>>
>>>
>>> Sorry, our emails keep criss-crossing.
>>>
>>> Because the mate-pairing is now done in C yieldSize is no longer a
>>> constraint.
>>>
>>> When yieldSize is given, say 100, then 100 mates are returned. The
>>> algo moves through the file until 100 records are successfully
>>> paired. These 100 are yielded to the user and the code picks up
>>> where it left off. A related situation is the 'which' in a param. In
>>> this case you want the mates in a particular range. The algo moves
>>> through the range and pairs what it can. If it's left with unmated
>>> records it goes outside the range looking for them.
>>> readGAlignmentsList() will return all records in this range, mated
>>> or not. The metadata column of 'mate_status' indicates the different
>>> groupings.
>>>
>>> Valerie
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>> On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
>>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> You no longer need to sort Bam files to use the pairing
>>> algo or
>>> yieldSize. The readGAlignment* functions now work with both
>>> constraints out of the box.
>>>
>>> Create a BamFile with yieldSize and indicate you want
>>> mates.
>>> bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>>>
>>> Maybe set some specifications in a param:
>>> param <- ScanBamParam(what = c("qname", "flag"))
>>>
>>> Then call either readGAlignment* method that handles pairs:
>>> readGAlignmentsList(bf, param=param)
>>> readGAlignmentPairs(bf, param=param)
>>>
>>> For summarizeOverlaps():
>>> summarizeOverlaps(annotation, bf, param=param,
>>> singleEnd=FALSE)
>>>
>>> We've considered removing the 'obeyQname' arg and
>>> documentation but
>>> thought the concept may be useful in another application.
>>> I'll
>>> revisit the summarizeOverlaps() documentation to make sure
>>> 'obeyQname' is downplayed and 'asMates' is encouraged.
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>> On 03/19/14 07:39, Michael Love wrote:
>>>
>>> hi,
>>>
>>> From last year, in order to use yieldSize with
>>> paired-end
>>> BAMs, I
>>>
>>> should sort the BAMs by qname and then use the
>>> following call to
>>> BamFile:
>>>
>>> library(pasillaBamSubset)
>>> fl <- sortBam(untreated3_chr4(), tempfile(),
>>> byQname=TRUE)
>>> bf <- BamFile(fl, index=character(0), yieldSize=3,
>>> obeyQname=TRUE)
>>>
>>>
>>> https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html
>>>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
>>>
>>> <https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>
>>>
>>> If I want to use
>>> GenomicAlignments::____readGAlignmentsList with
>>>
>>> asMates=TRUE and respecting the yieldSize, what is the
>>> proper
>>> construction? (in the end, I want to use
>>> summarizeOverlaps on
>>> paired-end BAMs while respecting the yieldSize)
>>>
>>> library(pasillaBamSubset)
>>> fl <- sortBam(untreated3_chr4(), tempfile(),
>>> byQname=TRUE)
>>> bf <- BamFile(fl, index=character(0), yieldSize=3,
>>> obeyQname=TRUE, asMates=TRUE)
>>> x <- readGAlignmentsList(bf)
>>> Warning message:
>>> In scanBam(bamfile, ..., param = param) :
>>> 'obeyQname=TRUE' ignored when 'asMates=TRUE'
>>> Calls: readGAlignmentsList ... .matesFromBam ->
>>> .load_bamcols_from_bamfile -> scanBam -> scanBam
>>>
>>> I see in the man pages for summarizeOverlaps it has:
>>>
>>> "In Bioconductor > 2.12 it is not
>>> necessary to sort paired-end BAM files by ‘qname’. When
>>> counting with ‘summarizeOverlaps’, setting
>>> ‘singleEnd=FALSE’
>>> will trigger paired-end reading and counting."
>>>
>>> but I don't see how this can respect the specified
>>> yieldSize,
>>> because
>>> readGAlignmentsList has to read in as many reads as
>>> necessary to
>>> find
>>> the mate.
>>>
>>> Sorry in advance if I am missing something in the
>>> documentation!
>>>
>>> Mike
>>>
>>> ___________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>
>>> mailing list
>>> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list